2018/7/21 Python 爬蟲學習

來源:互聯網
上載者:User

標籤:代碼   瀏覽器   url   response   htm   file   turn   NPU   request   

2018/7/21,這幾天整理出來的一些Python 爬蟲學習代碼。

import urllib2

response = urllib2.urlopen("http://baidu.com")

html = response.read()

print html

進一步,可以request

import urllib2

req = urllib2.Request("http://www.baidu.com")

response = urllib2.urlopen(req)

html = response.read()

print html

偽裝瀏覽器

import urllib2
url = "http://www.baidu.com"
user_agent = "Mozilla/5.0(compatible;MSTE 9.0;Windows NT 6.1;Trident/5.0;"
headers = {"User-Agent‘:user_agent}
req = urllib2.Request(rul,headers = headers)
response = urllib2.urlopen(req)
the_page = response.read()
print the_page

代碼:輸入輸出網頁

# _*_ coding:utf-8 _*_
import urllib2

def load_page(url):
user_agent = "Mozilla/5.0 (compatible;MSTE 9.0;Windows NT 6.1;Trident/5.0;"
headers = {"User-Agent":user_agent}
req = urllib2.Request(url,headers = headerss)
response = urllib2.urlopen(req)
html = response.read()
return html

def tieba_spider(url,begin_page,end_page):
"""
貼吧爬蟲的方法
"""
for i in range(begin_page,end_page +1):
pn = 50 * (i-1)
my_url = url + str(pn)
html = load_page(my_url)
print "##################第%頁########################" %(i)
print html
print "###############################################"

if __name__ == "__main__":
url = raw_input("請輸入貼吧的url地址")
begin_page = int(raw_input("請輸入起始頁碼"))
end_page = int(raw_input("請輸入終止頁碼"))

tieba_spider(url,begin_page,end_page)

代碼:輸入輸出儲存網頁

# _*_ coding:utf-8 _*_
import urllib2

def load_page(url):
user_agent = "Mozilla/5.0 (compatible;MSTE 9.0;Windows NT 6.1;Trident/5.0;"
headers = {"User-Agent":user_agent}
req = urllib2.Request(url,headers = headers)
response = urllib2.urlopen(req)
html = response.read()
return html

def writee_to_file(file_name,txt):
"""將txt文本存入到file_name檔案中
"""
print "正在隱藏檔" +filr_name
f = oprn(file_name,‘w‘)
f = write(txt)
f.close(0

def tieba_spider(url,begin_page,end_page):
"""
貼吧爬蟲的方法
"""
for i in range(begin_page,end_page + 1):
pn = 50 * (i-1)
my_url = url + srt(pn)
html = load_page(my_url)

filr_name = str(i) + ".html"
write_to_file(file_name,html)

if __name__ == "__main__":
url = raw_input("請輸入貼吧的url地址")
begin_page = int(raw_input("請輸入起始頁碼"))
end_page = int(raw_input("請輸入終止頁碼"))

tieba_spider(url,begin_page,end_page)

2018/7/21 Python 爬蟲學習

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.