The tools are ready, and then the Python crawler is packaged.

Last Update:2017-11-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Python crawler is easy for everyone to understand. Well, since we're used to it, of course we have to encapsulate it.

So we can first encapsulate a parent crawler.

My own design idea is, first of all, the crawler must have a field to store the matching rules gainrule, and then there is a field to store what attributes to take outattr,

Then there is a list of data to be processed gainlist, and finally a outlist to store the output list data, and a outdata to store the output of a single data

So this reptile's parent class is defined as follows

 fromBs4ImportBeautifulSoupImportRequestsImportReclassSPIDERHP:#gainrule page parsing rules, outattr page stored rules, gainlist need to parse the list page,    def __init__(self,gainrule,outattr=none,gainlist=None): Self.headers= {"user-agent":"mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/56.0.2924.87 safari/537.36"} self.gainrule=Gainrule self.outattr=outattr self.gainlist=gainlist self.req=requests. Session () self.outlist=[] Self.outdata=""    #working with List data    defStartall (self,gainlist=None):ifgainList:self.gainList=gainlist forUrlinchself.gainList:self. Initurllist (URL)#working with single-page data    defStart (self,gaindata): self. Initurllist (Gaindata)

The basic function of the crawler is OK after. Then we have to define our own kind of crawler.

For example, we generally need a crawl of a single page, a single eigenvalue of ordinary crawler, then. We write a reptile to inherit the parent class

#single page Data CrawlerclassSPIDERSIGDATAHP (SPIDERHP):definiturllist (self, url): Reqdata= Self.req.get (URL, headers=self.headers) Soup= BeautifulSoup (Reqdata.text,"lxml") NodeList=Soup.select (self.gainrule)ifnodeList:ifSelf.outAttr:self.outData=nodelist[0].get (self.outattr)Else: Self.outdata= Nodelist[0]

Like this newly-defined crawler, we can generally use to crawl the number of pages and the like.

Then we define a crawler that specifically handles list pages.

#list page Generic crawlerclassSPIDERLISTHP (SPIDERHP):definiturllist (self, url): Reqdata= Self.req.get (URL, headers=self.headers) Soup= BeautifulSoup (Reqdata.text,"lxml") NodeList=Soup.select (self.gainrule) forNodeinchnodeList:ifSelf.outAttr:data=node.get (self.outattr)Else: Data=nodeifData not inchself.outList:self.outList.append (data)if  notnodeList:Print("NodeList Err", URL)

Finally, a crawler that defines a detail page

#Details page CrawlerclassSPIDERDETAILHP (SPIDERHP):definiturllist (self, url): Reqdata= Self.req.get (URL, headers=self.headers) Soup= BeautifulSoup (Reqdata.text,"lxml") Data= {}         forKeyinchSelf.gainRule:ps=Soup.select (Self.gainrule[key])ifPS:ifSelf.outattr[key]: Data[key]=Ps[0].get (Self.outattr[key])Else: Data[key]=Ps[0] Str=repr (Data[key])#remove the label data. Generally, if you have a label at the end. It's useless.Data[key]=re.sub ("<.+?>","", str) self.outList.append (data)

So our crawler is done. If there are other special needs. You can define it yourself.

Generally used by the combination of these three kinds of reptiles. The capture of most Web pages can be resolved. Then I'm going to use it casually.

ImportSpiderImportRehome="http://www.xxxxxxx.net/"  #I'm not going to tell you what I'm climbing.defmain (): URL= Home +"hmh/list_6_1.html"Num=getpage (URL)#get the number of pageslist=[home+"hmh/list_6_{}.html". Format (i) forIinchRange ()] Hlist=getList (list) forIinchRange (len (hlist)): Hlist[i]=home+Hlist[i]Print(Hlist[i]) imglist=Getdetail (hlist)Print(imglist)Print(Len (imglist))#gets the number of pages pageddefgetpage (URL): Gainrule="Span.pageinfo > Strong"Mgr=SPIDER.SPIDERSIGDATAHP (gainrule) mgr.start (URL) str=repr (mgr.outdata)#Remove the contents of all tagsNum=int (Re.sub ("<.+?>","", str)) returnNum#Get list pagedefgetList (list): Gainrule="ul.piclist > li > A"outattr="href"Mgr=spider.spiderlisthp (Gainrule, outattr) mgr.startall (list)returnmgr.outlist#Get Details page informationdefGetdetail (list): Gaindata={} outattr={} gaindata["Image"]="#imgshow > IMG"gaindata["page"]="Li.thisclass > a"outattr["Image"]="src"outattr["page"]=""Mgr=spider.spiderdetailhp (Gaindata, outattr) mgr.startall (list)returnmgr.outlistif __name__=="__main__": Main ()

All right. That's it. Finally cooperate to download and save the database

The

tool is almost ready, and then the Python crawler is packaged

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The tools are ready, and then the Python crawler is packaged.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The tools are ready, and then the Python crawler is packaged.

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support