沒有內涵段子可以刷了，利用Python爬取段友之家貼吧圖片和小視頻(含源碼)

最後更新：2018-06-05 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：python 興趣爬蟲職業

由於最新的視頻整頓風波，內涵段子APP被迫關閉，廣大段友無家可歸，但是最近發現了一個“段友”的app，版本更新也挺快，正在號召廣大段友回家，如，有興趣的可以下載看看（ps：我不是打廣告的，沒收廣告費的）

同時，之前同事也發了一個貼吧的段子聚居地，客官稍等，馬上奉上串連：
段友之家?https://tieba.baidu.com/f?ie=...

然後呢，看到上面，確實好多段友在上面，於是乎，我就想爬取他們的圖片和小視頻，就有了這篇文章的主題：

其實吧，用Python爬取網站資料是最基礎的東西，也不難，但是我還想分享給大家，一起學習和交流。

爬取這些網站裡的資料主要用的模組是bs4、requests以及os，都是常用模組

大概思路就是通過requests模組請求網頁html資料，然後通過bs4模組下的BeautifulSoup分析請求的網頁，然後通過css尋找器尋找內涵段子的圖片以及小視頻的地址，主要實現代碼如下：

def download_file(web_url):    """擷取資源的url"""    # 下載網頁    print(‘正在下載網頁： %s...‘ % web_url)    result = requests.get(web_url)    soup = bs4.BeautifulSoup(result.text, "html.parser")    # 尋找圖片資源    img_list = soup.select(‘.vpic_wrap img‘)    if img_list == []:        print(‘未發現圖片資源！‘)    else:        # 找到資源，開始寫入        for img_info in img_list:            file_url = img_info.get(‘bpic‘)            write_file(file_url, 1)    # 尋找視頻資源    video_list = soup.select(‘.threadlist_video a‘)    if video_list == []:        print(‘未發現視頻資源！‘)    else:        # 找到資源，開始寫入        for video_info in video_list:            file_url = video_info.get(‘data-video‘)            write_file(file_url, 2)    print(‘下載資源結束：‘, web_url)    next_link = soup.select(‘#frs_list_pager .next‘)    if next_link == []:        print(‘下載資料結束！‘)    else:        url = next_link[0].get(‘href‘)        download_file(‘https:‘ + url)學習Python+  725479218

得到圖片以及視頻的地址之後，肯定還不夠，還得把這些資源寫入到本地，方式是通過二進位的方式來讀取遠程檔案資源，然後分類寫入到本地，實現的主要代碼如下：

def write_file(file_url, file_type):    """寫入檔案"""    res = requests.get(file_url)    res.raise_for_status()    # 檔案類型分檔案夾寫入    if file_type == 1:        file_folder = ‘nhdz\\jpg‘    elif file_type == 2:        file_folder = ‘nhdz\\mp4‘    else:        file_folder = ‘nhdz\\other‘    folder = os.path.exists(file_folder)    # 檔案夾不存在，則建立檔案夾    if not folder:        os.makedirs(file_folder)    # 開啟檔案資源，並寫入    file_name = os.path.basename(file_url)    str_index = file_name.find(‘?‘)    if str_index > 0:        file_name = file_name[:str_index]    file_path = os.path.join(file_folder, file_name)    print(‘正在寫入資源檔：‘, file_path)    image_file = open(file_path, ‘wb‘)    for chunk in res.iter_content(100000):        image_file.write(chunk)    image_file.close()    print(‘寫入完成！‘)學習Python+  725479218

最後，再奉上完整的代碼吧。要不然，會被人說的，說話說一半，說福利，也不給全，這就太不夠意思了。客官別急，馬上奉上……

#!/usr/bin/env python# -*- coding: utf-8 -*-"""爬取百度貼吧，段友之家的圖片和視頻author: cuizytime：2018-05-19"""import requestsimport bs4import osdef write_file(file_url, file_type):    """寫入檔案"""    res = requests.get(file_url)    res.raise_for_status()    # 檔案類型分檔案夾寫入    if file_type == 1:        file_folder = ‘nhdz\\jpg‘    elif file_type == 2:        file_folder = ‘nhdz\\mp4‘    else:        file_folder = ‘nhdz\\other‘    folder = os.path.exists(file_folder)    # 檔案夾不存在，則建立檔案夾    if not folder:        os.makedirs(file_folder)    # 開啟檔案資源，並寫入    file_name = os.path.basename(file_url)    str_index = file_name.find(‘?‘)    if str_index > 0:        file_name = file_name[:str_index]    file_path = os.path.join(file_folder, file_name)    print(‘正在寫入資源檔：‘, file_path)    image_file = open(file_path, ‘wb‘)    for chunk in res.iter_content(100000):        image_file.write(chunk)    image_file.close()    print(‘寫入完成！‘)學習Python+  725479218def download_file(web_url):    """擷取資源的url"""    # 下載網頁    print(‘正在下載網頁： %s...‘ % web_url)    result = requests.get(web_url)    soup = bs4.BeautifulSoup(result.text, "html.parser")    # 尋找圖片資源    img_list = soup.select(‘.vpic_wrap img‘)    if img_list == []:        print(‘未發現圖片資源！‘)    else:        # 找到資源，開始寫入        for img_info in img_list:            file_url = img_info.get(‘bpic‘)            write_file(file_url, 1)    # 尋找視頻資源    video_list = soup.select(‘.threadlist_video a‘)    if video_list == []:        print(‘未發現視頻資源！‘)    else:        # 找到資源，開始寫入        for video_info in video_list:            file_url = video_info.get(‘data-video‘)            write_file(file_url, 2)    print(‘下載資源結束：‘, web_url)    next_link = soup.select(‘#frs_list_pager .next‘)    if next_link == []:        print(‘下載資料結束！‘)    else:        url = next_link[0].get(‘href‘)        download_file(‘https:‘ + url)# 主程式入口if __name__ == ‘__main__‘:    web_url = ‘https://tieba.baidu.com/f?ie=utf-8&kw=段友之家‘    download_file(web_url)

沒有內涵段子可以刷了，利用Python爬取段友之家貼吧圖片和小視頻(含源碼)

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More