python實現網路爬蟲

最後更新：2018-12-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

一.簡介

該爬蟲程式包含2個類，一個管理整個crawling進程（Crawler），一個檢索並解析每一個下載的web頁面(Retriever)。

二.程式

#!/usr/bin/env pythonfrom sys import argvfrom os import makedirs,unlink,sepfrom os.path import dirname,exists,isdir,splitextfrom string import replace,find,lowerfrom htmllib import HTMLParserfrom urllib import urlretrievefrom urlparse import urlparse,urljoinfrom formatter import DumbWriter,AbstractFormatterfrom cStringIO import StringIOclass Retriever(object): #download web pages  def __init__(self,url):    self.url = url    self.file = self.filename(url)  def filename(self,url,deffile='index.htm'):    parsedurl = urlparse(url,'http:',0) ## parse path    path = parsedurl[1] + parsedurl[2]    ext = splitext(path)    if ext[1] == '' : #no file,use default       if path[-1] == '/':          path += deffile       else:          path += '/' + deffile    ldir = dirname(path) #local directory    if sep != '/': # os-indep. path separator       ldir = replace(ldir,'/',sep)    if not isdir(ldir): # create archive dir if nec.       if exists(ldir): unlink(ldir)       makedirs(ldir)    return path  def download(self): #download Web page    try:      retval = urlretrieve(self.url,self.file)    except IOError:      retval = ('*** ERROR: invalid URL "%s"' % \          self.url,)      return retval    def parseAndGetLinks(self): #parse HTML,save links    self.parser = HTMLParser(AbstractFormatter(\       DumbWriter(StringIO())))    self.parser.feed(open(self.file).read())    self.parse.close()    return self.parser.anchorlist    class Crawler(object): #manage entire crawling process  count = 0 #static downloaded page counter    def __init__(self,url):    self.q = [url]       self.seen = []   #have seen the url    self.dom = urlparse(url)[1]    def getPage(self,url):    r = Retriever(url)    retval = r.download()    if retval[0] == '*': # error situation,do not parse      print retval,'... skipping parse'      return     Crawler.count += 1    print '\n(',Crawler.count,')'    print 'URL:',url    print 'FILE:',retval[0]    self.seen.append(url)    links = r.parseAndGetLinks() #get and process links    for eachLink in links:       if eachLink[:4] != 'http' and \          find(eachLink,'://') == -1:          eachLink = urljoin(url,eachLink)       print '* ',eachLink,              if find(lower(eachLink),'mailto:') != -1:          print '... discarded,mailto link'          continue              if eachLink not in self.seen:          if find(eachLink,self.dom) == -1:             print '... discarded, not in domain'          else:             if eachLink not in self.q:                  self.q.append(eachLink)                  print '... new, added to Q'             else:                print '... discarded, already in Q'       else:           print '... discarded, already processed'  def go(self): # process links in queue      while self.q:          url = self.q.pop()          self.getPage(url)def main():  if len(argv) > 1:      url = argv[1]  else:     try:       url = raw_input('Enter starting URL:')     except(KeyboardInterrupt,EOFError):       url = ''     if not url: return      robot = Crawler(url)     robot.go()if __name__ == '__main__':   main()

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More