Python學習之路（五）爬蟲（四）正則表示式爬去名言網

最後更新：2018-03-28 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：SM end sele 人性分享 def htm 去掉 get

爬蟲的四個主要步驟

明確目標 (要知道你準備在哪個範圍或者網站去搜尋)
爬 (將所有的網站的內容全部爬下來)
取 (去掉對我們沒用處的資料)
處理資料（按照我們想要的方式儲存和使用）

什麼是Regex

Regex，又稱規則運算式，通常被用來檢索、替換那些符合某個模式(規則)的文本。

Regex是對字串操作的一種邏輯公式，就是用事先定義好的一些特定字元、及這些特定字元的組合，組成一個“規則字串”，這個“規則字串”用來表達對字串的一種過濾邏輯。

給定一個Regex和另一個字串，我們可以達到如下的目的：

給定的字串是否符合Regex的過濾邏輯（“匹配”）；

通過Regex，從文本字串中擷取我們想要的特定部分（“過濾”）。

Regex匹配規則

Python 的 re 模組

在 Python 中，我們可以使用內建的 re 模組來使用Regex。

有一點需要特別注意的是，Regex使用對特殊字元進行轉義，所以如果我們要使用原始字串，只需加一個 r 首碼，樣本：

r‘chuanzhiboke\t\.\tpython‘

使用正則爬去名言網的名言，只擷取首頁的10條資料

from urllib.request import urlopenimport redef spider_quotes():    url = "http://quotes.toscrape.com"    response = urlopen(url)    html = response.read().decode("utf-8")    #  擷取 10  個  名言    quotes = re.findall(‘<span class="text" itemprop="text">(.*)</span>‘,html)    list_quotes = []    for quote in quotes:        #  strip 從兩邊開始搜尋，只要發現某個字元在當前這個方法的範圍內，統統去掉        list_quotes.append(quote.strip("“”"))    # 擷取 10 個名言的作者    list_authors = []    authors = re.findall(‘<small class="author" itemprop="author">(.*)</small>‘,html)    for author in authors:        list_authors.append(author)    # 擷取這10個名言的  標籤    tags = re.findall(‘<div class="tags">(.*?)</div>‘,html,re.RegexFlag.DOTALL)    list_tags = []    for tag in tags:        temp_tags = re.findall(‘<a class="tag" href=".*">(.*)</a>‘,tag)        tags_t1 = []        for tag in temp_tags:            tags_t1.append(tag)        list_tags.append(",".join(tags_t1))    # 結果匯總    results = []    for i in range(len(list_quotes)):        results.append("\t".join([list_quotes[i],list_authors[i],list_tags[i]]))    for result in results:        print(result)#調取方法spider_quotes()

BeautifulSoup4解析器

BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支援CSS選取器、Python標準庫中的HTML解析器，也支援 lxml 的 XML解析器。

官方文檔：http://beautifulsoup.readthedocs.io/zh_CN/v4.4.0

使用BeautifulSoup4擷取名言網首頁資料

from urllib.request import urlopenfrom bs4 import BeautifulSoupurl = "http://quotes.toscrape.com"response = urlopen(url)# 初始化一個 bs 執行個體#  對應的response對象的解析器， 最常用的解析方式，就是預設的  html.parserbs = BeautifulSoup(response, "html.parser")#  擷取 10  個  名言spans = bs.select("span.text")list_quotes = []for span in spans:    span_text = span.text    list_quotes.append(span_text.strip("“”"))# 擷取 10 個名言的作者authors = bs.select("small")list_authors = []for author in authors:    author_text = author.text    list_authors.append(author_text)# 擷取這10個名言的  標籤divs = bs.select("div.tags")list_tags = []for div in divs:    tag_text = div.select("a.tag")    tag_list = [ tag_a.text for tag_a in tag_text]    list_tags.append(",".join(tag_list))#結果匯總results = []for i in range(len(list_quotes)):    results.append("\t".join([list_quotes[i],list_authors[i],list_tags[i]]))for result in results:    print(result)

Python學習之路（五）爬蟲（四）正則表示式爬去名言網

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More