python爬取基礎網頁圖片

最後更新：2018-04-08 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：status def 寫入 targe rom path att markdown 地址

python基礎爬蟲總結1.爬取資訊原理

與瀏覽器用戶端類似，向網站的伺服器發送一個請求，該請求一般是url,也就是網址。之後伺服器響應一個html頁面給用戶端，當然也有其他資料類型的資訊，這些就是網頁內容。我們要做的就是解析這些資訊，然後選擇我們想要的，將它爬取下來按要求寫入到本地。

2. 爬蟲基本流程
1.擷取網頁的響應的資訊

這裡有兩個常用的方法

html = requests.get(url)return html.text

或者

html = urllib.request.urlopen(url)return html.read()

第一個get方法會返回一個Response對象，裡面有伺服器返回的所有資訊，包括回應標頭，響應狀態代碼等。直接輸出html，只有這個<Response [200]>，要將資訊提取出來有兩個方法，content和text，content返回bytes型資料，text返回Unicode型資料（這種初級爬蟲用什麼都一樣，編碼什麼的我還在研究-_-)，這裡我們直接返回.text。
第二個方法我引用網上一句話：

urlopen開啟URL網址，url參數可以是一個字串url或者是一個Request對象，返回的是http.client.HTTPResponse對象.http.client.HTTPResponse對象大概包括read()、readinto()、getheader()、getheaders()、fileno()、msg、version、status、reason、debuglevel和closed函數，其實一般而言使用read()函數後還需要decode()函數，這裡一個巨大的優勢就是：返回的網頁內容實際上是沒有被解碼或的，在read()得到內容後通過指定decode()函數參數，可以使用對應的解碼方式。

2.解析網頁內容

Regex是個很好的選擇，但我不怎麼會用。然而一個強大的第三方庫給我提供了很大的協助，Beautifulsoup。

soup = BeautifulSoup(html,‘html.parser)urls = soup.find_all(‘div‘,attrs={‘class‘:‘bets-name‘})print(urls[0])

BeautifulSoup給我們提供了很多方法，先建立一個soup執行個體，用html.parer內建解析器，也可以選lxml等。然後根據目標標籤中的內容傳入參數，找到目標標籤，注意find_all返回的對象。

3.將資訊下載到本地

如果是文本資訊可以直接寫入，圖片資訊的話就要再次訪問圖片連結，然後以content方法寫入

3.爬取站酷圖片

這裡以Pycharm作為開發工具！

# coding: utf-8# data: 2018/04/04#target: Pictures on ZHANKfrom bs4 import BeautifulSoupimport requestsimport urllib.requestdef get_html(url):    html = requests.get(url)    return html.textdef Download(html,filepath):    soup = BeautifulSoup(html,‘html.parser‘)    urls = soup.find_all(‘div‘,class_="imgItem maskWraper")    count = 1    try:        for url in urls:            img = url.find(‘img‘)            print(img)            img_url = img[‘data-original‘]            req = requests.get(img_url)            with open(filepath + ‘/‘ + str(count) + ‘.jpg‘, ‘wb‘) as f:                        #以二進位形式寫入檔案                f.write(req.content)            count += 1            if count == 11:      #爬取十張圖片就停止                break    except Exception as e:        print(e)def main():    url = "http://www.hellorf.com/image/search/%E5%9F%8E%E5%B8%82/?utm_source=zcool_popular"  #目標網址    filepath = "D://案頭/Python/study_one/Spider_practice/Spider_File/icon"                    #圖片儲存地址    html = get_html(url)    Download(html,filepath)if __name__ == "__main__":    main()

python爬取基礎網頁圖片

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

python爬取基礎網頁圖片

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support