Python 爬蟲學習筆記之單線程爬蟲

Python 爬蟲學習筆記之單線程爬蟲_python

最後更新：2017-01-18 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

介紹

本篇文章主要介紹如何爬取麥子學院的課程資訊（本爬蟲仍是單線程爬蟲），在開始介紹之前，先來看看結果示意圖

怎麼樣，是不是已經躍躍欲試了？首先讓我們開啟麥子學院的網址，然後找到麥子學院的全部課程資訊，像下面這樣

這個時候進行翻頁，觀看網址的變化，首先，第一頁的網址是 http://www.maiziedu.com/course/list/, 第二頁變成了 http://www.maiziedu.com/course/list/all-all/0-2/, 第三頁變成了 http://www.maiziedu.com/course/list/all-all/0-3/ ，可以看到，每次翻一頁，0後面的數字就會遞增1，然後就有人會想到了，拿第一頁呢？我們嘗試著將 http://www.maiziedu.com/course/list/all-all/0-1/ 放進瀏覽器的地址欄，發現可以開啟第一欄，那就好辦了，我們只需要使用 re.sub() 就可以很輕鬆的擷取到任何一頁的內容。擷取到網址連結之後，下面要做的就是擷取網頁的原始碼，首先右擊查看審查或者是檢查元素，就可以看到以下介面

找到課程所在的位置以後，就可以很輕鬆的利用Regex將我們需要的內容提取出來，至於怎麼提取，那就要靠你自己了，嘗試著自己去找規律才能有更大的收穫。如果你實在不知道怎麼提取，那麼繼續往下，看我的原始碼吧

實戰原始碼

 # coding=utf-8 import re import requests import sys reload(sys) sys.setdefaultencoding("utf8")   class spider():   def __init__(self):     print "開始爬取內容。。。"     def changePage(self, url, total_page):     nowpage = int(re.search('/0-(\d+)/', url, re.S).group(1))     pagegroup = []      for i in range(nowpage, total_page + 1):       link = re.sub('/0-(\d+)/', '/0-%s/' % i, url, re.S)       pagegroup.append(link)      return pagegroup def getsource(self, url):  html = requests.get(url)  return html.text def getclasses(self, source):  classes = re.search('<ul class="zy_course_list">(.*?)</ul>', source, re.S).group(1)  return classes def geteach(self, classes):  eachclasses = re.findall('<li>(.*?)</li>', classes, re.S)  return eachclasses def getinfo(self, eachclass):  info = {}  info['title'] = re.search('<a title="(.*?)"', eachclass, re.S).group(1)  info['people'] = re.search('<p class="color99">(.*?)</p>', eachclass, re.S).group(1)  return info def saveinfo(self, classinfo):  f = open('info.txt', 'a')   for each in classinfo:    f.writelines('title : ' + each['title'] + '\n')    f.writelines('people : ' + each['people'] + '\n\n')   f.close()  if __name__ == '__main__':    classinfo = []   url = 'http://www.maiziedu.com/course/list/all-all/0-1/'   maizispider = spider()   all_links = maizispider.changePage(url, 30)   for each in all_links:     htmlsources = maizispider.getsource(each)     classes = maizispider.getclasses(htmlsources)     eachclasses = maizispider.geteach(classes)      for each in eachclasses:       info = maizispider.getinfo(each)       classinfo.append(info)    maizispider.saveinfo(classinfo)

以上代碼並不難懂，基本就是Regex的使用，然後直接運行就可以看到開頭我們的截圖內容了，由於這是單線程爬蟲，所以運行速度感覺有點慢，接下來還會繼續更新多線程爬蟲。

應小夥伴們的要求，下面附上requests爬蟲庫的安裝和簡單樣本

首先安裝pip包管理工具,下載get-pip.py. 我的機器上安裝的既有python2也有python3。

安裝pip到python2：

python get-pip.py

安裝到python3：

python3 get-pip.py

pip安裝完成以後，安裝requests庫開啟python爬蟲學習。

安裝requests

pip3 install requests

我使用的python3，python2可以直接用pip install requests.

入門例子

import requestshtml=requests.get("http://gupowang.baijia.baidu.com/article/283878")html.encoding='utf-8'print(html.text)

第一行引入requests庫，第二行使用requests的get方法擷取網頁原始碼，第三行設定編碼格式，第四行文本輸出。
把擷取到的網頁原始碼儲存到文字檔中：

import requestsimport oshtml=requests.get("http://gupowang.baijia.baidu.com/article/283878")html_file=open("news.txt","w")html.encoding='utf-8'print(html.text,file=html_file)

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More