利用Python爬去囧網福利(多線程、urllib、request)

最後更新：2018-09-04 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：二進位 get mys lte head 請求寫入 self socket

import os;import urllib.request;import re;import threading;# 多線程from urllib.error import URLError#接收異常‘s 模組#擷取網站的源碼class QsSpider:    #init 初始化建構函式 .self本身    def __init__(self):      self.user_agent=‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36‘      self.header = {‘User-Agent‘:self.user_agent}      self.url = ‘http://www.qiubaichengren.net/%s.html‘      self.save_dir = ‘./img‘      self.page_num = 20  #page num#擷取網站原始碼    def load_html(self,page):        try:            web_path = self.url % page            request = urllib.request.Request(web_path,headers=self.header)            with urllib.request.urlopen(request) as f:                html_content = f.read().decode(‘gbk‘)                #print(html_content)                self.pick_pic(html_content)        except URLError as e :            print(e.reason) #異常原因    #download    def sava_pic(self,img):        save_path = self.save_dir + "/" +img.replace(‘:‘,‘@‘).replace(‘/‘,‘_‘)        if not os.path.exists(self.save_dir):            os.makedirs(self.save_dir)        print(save_path)        urllib.request.urlretrieve(img,save_path)    #filter    def pick_pic(self,html_content):        patren = re.compile(r‘src="(http:.*?\.(?:jpg|png|gif))‘)        pic_path_list = patren.findall(html_content)        for i in pic_path_list:           #print(i)           self.sava_pic(str(i))    #mamy threading    def start(self):        for i in range(1,self.page_num):            thread = threading.Thread(target=self.load_html,args=str(i))            thread.start()# main voidspider = QsSpider()spider.start()

一、爬蟲流程：

1、發起請求

使用http庫向目標網站發起請求，即發送一個Request

Request包含：要求標頭、請求體等

Request模組缺陷：不能執行JS 和CSS 代碼

2、擷取響應內容

如果伺服器能正常響應，則會得到一個Response

Response包含：html，json，圖片，視頻等

3、解析內容

解析html資料：Regex（RE模組），第三方解析庫如Beautifulsoup，pyquery等

解析json資料：json模組

解析位元據:以wb的方式寫入檔案

4、儲存資料

資料庫（MySQL，Mongdb、Redis）

檔案

二、響應Response

1、響應狀態代碼

　　200：代表成功

　　301：代表跳轉

　　404：檔案不存在

　　403：無許可權訪問

　　502：伺服器錯誤

三、http協議請求與響應

Request：使用者將自己的資訊通過瀏覽器（socket client）發送給伺服器（socket server）

Response：伺服器接收請求，分析使用者發來的請求資訊，然後返回資料（返回的資料中可能包含其他連結，如：圖片，js，css等）

ps：瀏覽器在接收Response後，會解析其內容來顯示給使用者，而爬蟲程式在類比瀏覽器發送請求然後接收Response後，是要提取其中的有用資料。

四、結果（福利）

利用Python爬去囧網福利(多線程、urllib、request)

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More