python寫簡單爬蟲的五種方法

來源:互聯網
上載者:User

擷取html的方法【一】:使用urllib


# -*- coding: UTF-8 -*-

import urllib

 

' 擷取web頁面內容並返回'

def getWebPageContent(url):

    f = urllib.urlopen(url)

    data = f.read()

    f.close()

return data

 

url = 'http://blog.csdn.net'

content = getWebPageContent(url)

print content


擷取html的方法【二】:使用Pycurl

# Pycurl參考地址:http://pycurl.sourceforge.net/

# Pycurl:http://pycurl.sourceforge.net/download/pycurl-7.18.1.tar.gz

# -*-coding: UTF-8 -*-

importpycurl

importStringIO

 

defgetURLContent_pycurl(url):   

    c = pycurl.Curl()

    c.setopt(pycurl.URL,url)

    b = StringIO.StringIO()

    c.setopt(pycurl.WRITEFUNCTION, b.write)

    c.setopt(pycurl.FOLLOWLOCATION, 1)

    c.setopt(pycurl.MAXREDIRS, 5)

    # 代理

    #c.setopt(pycurl.PROXY, 'http://11.11.11.11:8080')

    #c.setopt(pycurl.PROXYUSERPWD, 'aaa:aaa')

    c.perform()

    returnb.getvalue()

 

url = 'http://blog.csdn.net'

content =getURLContent_pycurl(url)

print content


擷取html的方法【三】:使用cPAMIE

cPAMIE下載:http://sourceforge.net/project/showfiles.php?group_id=103662

# -*-coding: UTF-8 -*- 

import cPAMIE

defgetURLContent_cPAMIE(url):

        g_ie =cPAMIE.PAMIE()

        g_ie.showDebugging = False

        g_ie.frameName= None

        g_ie.navigate(url)   

content =g_ie.pageGetText()

g_ie.quit()    

returncontent

 

url = 'http://blog.csdn.net'

content = getURLContent_cPAMIE(url)

print content


擷取html的方法【四】:使用urllib下載檔案

# -*- coding: UTF-8 -*-

import urllib

 

url = 'http://blog.csdn.net'

path = 'C://temp//csdn.net.html'

urllib.urlretrieve(url,path)


擷取html的方法【四】:利用Twisted架構之client.getPage

# Twisted架構下載:

http://tmrc.mit.edu/mirror/twisted/Twisted/8.1/Twisted_NoDocs-8.1.0.win32-py2.5.exe

# -*-coding: UTF-8 -*-

fromtwisted.internet import reactor

fromtwisted.web import client

 

defresult(content):

    print content

    reactor.stop()

 

deferred =client.getPage("http://blog.csdn.net")

deferred.addCallback(result)   

reactor.run()

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.