python爬蟲之requests+selenium+BeautifulSoup

來源:互聯網
上載者:User

標籤:python3   --   title   注釋   地址   tor   ref   編碼   pip   

前言:

  • 環境配置:windows64、python3.4
  • requests庫基本操作:

1、安裝:pip install requests

2、功能:使用 requests 發送網路請求,可以實現跟瀏覽器一樣發送各種HTTP請求來擷取網站的資料。

3、命令集操作:

import requests  # 匯入requests模組r = requests.get("https://api.github.com/events")  # 擷取某個網頁# 設定逾時,在timeout設定的秒數時間後停止等待響應r2 = requests.get("https://api.github.com/events", timeout=0.001)payload = {‘key1‘: ‘value1‘, ‘key2‘: ‘value2‘}r1 = requests.get("http://httpbin.org/get", params=payload)print(r.url)  # 列印輸出urlprint(r.text)  # 讀取伺服器響應的內容print(r.encoding)  # 擷取當前編碼print(r.content)  # 以位元組的方式請求響應體print(r.status_code)  # 擷取響應狀態代碼print(r.status_code == requests.codes.ok)  # 使用內建的狀態代碼查詢對象print(r.headers)  # 以一個python字典形式展示的伺服器回應標頭print(r.headers[‘content-type‘])  # 大小寫不敏感,使用任意形式訪問這些回應標頭欄位print(r.history)  # 是一個response對象的列表print(type(r))  # 返回請求類型
  • BeautifulSoup4庫基本操作:

1、安裝:pip install BeautifulSoup4

2、功能:Beautiful Soup 是一個可以從HTML或XML檔案中提取資料的Python庫。

3、命令集操作:

 1 import requests 2 from bs4 import BeautifulSoup
3 html_doc = """ 4 <html><head><title>The Dormouse‘s story</title></head> 5 <body> 6 <p class="title"><b>The Dormouse‘s story</b></p> 7 8 <p class="story">Once upon a time there were three little sisters; and their names were 9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;12 and they lived at the bottom of a well.</p>13 14 <p class="story">...</p>15 """16 17 ss = BeautifulSoup(html_doc,"html.parser")18 print (ss.prettify()) #按照標準的縮排格式的結構輸出19 print(ss.title) # <title>The Dormouse‘s story</title>20 print(ss.title.name) #title21 print(ss.title.string) #The Dormouse‘s story22 print(ss.title.parent.name) #head23 print(ss.p) #<p class="title"><b>The Dormouse‘s story</b></p>24 print(ss.p[‘class‘]) #[‘title‘]25 print(ss.a) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>26 print(ss.find_all("a")) #[。。。]29 print(ss.find(id = "link3")) #<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>30 31 for link in ss.find_all("a"):32 print(link.get("link")) #擷取文檔中所有<a>標籤的連結33 34 print(ss.get_text()) #從文檔中擷取所有文字內容
 1 import requests 2 from bs4 import BeautifulSoup 3  4 html_doc = """ 5 <html><head><title>The Dormouse‘s story</title></head> 6 <body> 7 <p class="title"><b>The Dormouse‘s story</b></p> 8 <p class="story">Once upon a time there were three little sisters; and their names were 9 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,10 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and11 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;12 and they lived at the bottom of a well.</p>13 14 <p class="story">...</p>15 """
16 soup = BeautifulSoup(html_doc, ‘html.parser‘) # 聲明BeautifulSoup對象17 find = soup.find(‘p‘) # 使用find方法查到第一個p標籤18 print("find‘s return type is ", type(find)) # 輸出傳回值類型19 print("find‘s content is", find) # 輸出find擷取的值20 print("find‘s Tag Name is ", find.name) # 輸出標籤的名字21 print("find‘s Attribute(class) is ", find[‘class‘]) # 輸出標籤的class屬性值22 23 print(find.string) # 擷取標籤中的常值內容24 25 markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"26 soup1 = BeautifulSoup(markup, "html.parser")27 comment = soup1.b.string28 print(type(comment)) # 擷取注釋中內容
  • 小試牛刀:
1 import requests2 import io3 import sys4 sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding=‘gb18030‘) #改變標準輸出的預設編碼5 6 r = requests.get(‘https://unsplash.com‘) #像目標url地址發送get請求,返回一個response對象7 8 print(r.text) #r.text是http response的網頁HTML

 

參考連結:

78537432

http://www.cnblogs.com/Albert-Lee/p/6276847.html

78748531

python爬蟲之requests+selenium+BeautifulSoup

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.