Python開發之爬蟲基礎

最後更新：2018-08-11 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：file like 互連網 set lis www .text win com

爬蟲簡介

爬蟲：可以把互連網看做是一張大網，爬蟲就好像是這張網裡的蜘蛛，如果想得到這張網裡的資源，就可以將其抓取下來。

簡單來說就是請求網站並提取資料的自動化程式。

爬蟲的基本流程：

發起請求：通過HTTP庫向目標網站發送請求，即發送一個request，請求可以包含額外的headers等資訊，等待伺服器的響應。
擷取響應內容：如果伺服器能正常響應，會得到一個response，response的內容便是所要擷取的頁面內容，類型可能是HTML，JSON字串，位元據等類型。
解析內容：得到的內容可能是HTML，可以用Regex、網頁解析庫進行解析。可能是JSON，可以直接轉換為Json對象解析，可能是位元據，可以做儲存或者進一步的處理。
儲存資料：儲存資料的形式多樣，可以儲存為文本、也可以儲存在資料庫，或者儲存成特定格式的檔案。

Request和Response過程：

（1）瀏覽器就發送訊息給該網址所在的伺服器，這個過程就叫做HTTP Request

（2）伺服器收到瀏覽器發送的訊息後，能夠根據瀏覽器發送訊息的內容，做相應的處理，然後把訊息回傳給瀏覽器，這個過程就叫做HTTP Response

（3）瀏覽器收到伺服器的Response資訊後，會對資訊進行處理並展示。

Request請求：

請求方式：主要是GET、POST兩種類型，另外還有HEAD、PUT、DELETE、OPTIONS等。
請求URL：URL全稱是同一資源定位器，如一個網頁文檔、一張圖片、一個視頻等都可以用URL來唯一確定。
要求標頭：包含請求時的頭部資訊，如User-Agent、Host、Cookies等。
請求體：請求時額外攜帶的資料，如表單提交時表單資料。

Response響應：

響應狀態：有多種響應狀態，如200代表成功，301跳轉，404找不到頁面、502伺服器錯誤等
回應標頭：如內容類型、內容長度、伺服器資訊、設定cookie等等。
響應體：最主要的部分，包含了請求資源的內容，如網頁HTML、圖片、位元據等。

簡單一實例：

import requestsresponse= requests.get(‘http://www.baidu.com‘)print(response.text) # 得到響應體print(response.headers) # 得到相應頭print(response.status_code) # 狀態代碼

能抓什麼樣的資料？

網頁文本，如HTML文檔，JSON格式文本等。
圖片，擷取得到是二進位檔案，儲存為圖片格式。
視頻，同為二進位檔案，儲存為視頻格式即可。
其它，只要是能請求到的都可以擷取。

資料處理：

直接處理
JSON解析
Regex
BeautifulSoup
PyQuery
Xpath

怎麼儲存資料

文本，純文字、json、Xml等
關係型資料庫，如Mysql、Oracle等
非關係型資料庫，MongoDB、Redis等Key-value形式儲存
二進位檔案，片、視頻、音頻等直接儲存成指定格式即可。

小例子：

爬取https://www.autohome.com.cn/news/頁面上的a標籤的href和圖片，並將圖片儲存於本地

import requestsfrom bs4 import BeautifulSoupresponse = requests.get(    url=‘https://www.autohome.com.cn/news/‘)response.encoding = response.apparent_encoding  # 解決亂碼soup = BeautifulSoup(response.text,features=‘html.parser‘)target = soup.find(id=‘auto-channel-lazyload-article‘)li_list = target.find_all(‘li‘)for i in li_list: # 每一個i就是一個soup對象，就可以使用find繼續找    a = i.find(‘a‘) # 如果找不到a，調用a.attrs就會報錯，所有需要判斷    if a:        a_href = a.attrs.get(‘href‘)        a_href = ‘http:‘ + a_href        print(a_href)        txt = a.find(‘h3‘).text        print(txt)        img = a.find(‘img‘).attrs.get(‘src‘)        img = ‘http:‘ + img        print(img)        img_response = requests.get(url=img)        import uuid        filename = str(uuid.uuid4())+ ‘.jpg‘        with open(filename,‘wb‘) as f:            f.write(img_response.content)

簡單總結：

‘‘‘response = request.get(‘url‘)response.textresopnse.contentresponse.encodingresponse.encoding = response.apparent_encodingresponse.status_code‘‘‘‘‘‘soup = BeautifulSoup(response.text,features=‘html.parser‘)v1 = soup.find(‘div‘) # 找到第一個合格soup.find(id=‘i1‘)soup.find(‘div‘,id=‘i1‘)v2 = soup.find_all(‘div‘)obj = v1obj = v2[0] # 從列表中按索引取到每一個對象obj.textobj.attrs # 屬性‘‘‘

requests模組介紹

1、調用的方法關係

‘‘‘‘requests.get()requests.post()requests.put()requests.delete()...上面這些方法本質上都是調用的是requests.request()方法，例如：def get(url, params=None, **kwargs):    r"""Sends a GET request.    :param url: URL for the new :class:`Request` object.    :param params: (optional) Dictionary or bytes to be sent in the query string for the :class:`Request`.    :param \*\*kwargs: Optional arguments that ``request`` takes.    :return: :class:`Response <Response>` object    :rtype: requests.Response    """    kwargs.setdefault(‘allow_redirects‘, True)    return request(‘get‘, url, params=params, **kwargs)‘‘‘

2、常用參數：

‘‘‘requests.request()- method:提交方式- url:   提交地址- params:在URL上傳遞的參數，GET，例如    requests.request(        method=‘GET‘,        url=‘http://www.baidu.com‘,        params={‘username‘:‘user‘,‘password‘:‘pwd‘}    )    # http://www.baidu.com?username=user&password=pwd- data:在請求體裡傳遞的資料    requests.request(        method=‘POST‘,        url=‘http://www.baidu.com‘,        data={‘username‘:‘user‘,‘password‘:‘pwd‘}    )- json:在請求體裡傳遞的資料    requests.request(        method=‘POST‘,        url=‘http://www.baidu.com‘,        json={‘username‘:‘user‘,‘password‘:‘pwd‘}    )    # json="{‘username‘:‘user‘,‘password‘:‘pwd‘}" 整體發送- headers:要求標頭     requests.request(        method=‘POST‘,        url=‘http://www.baidu.com‘,        json={‘username‘:‘user‘,‘password‘:‘pwd‘},        headers={            ‘referer‘:‘https://dig.chouti.com/‘,            ‘user-agent‘:‘Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36‘        }    )- cookies: cookies,一般放在headers發過去。‘‘‘

Python開發之爬蟲基礎

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More