零基礎Python爬蟲實現(爬取最新電影排行)

最後更新：2018-02-26 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：tps 技術 movies user url beautiful roc class top

提示:本學習來自Ehco前輩的文章, 經過實現得出的筆記。

目標網站

http://dianying.2345.com/top/

網站結構

要爬的部分,在ul標籤下(包括li標籤), 大致來說迭代li標籤的內容輸出即可。

遇到的問題?

代碼簡單, 但遇到的問題很多。

一: 編碼

這裡統一使用gbk了。

二: 庫

過程中缺少requests,bs4,idna,certifi,chardet,urllib3等庫, 需要手動添加庫, 我說一下我的方法

庫的添加方法:

例如:urllib3

百度urllib3,通過連結下載到本地

我下載第一個

解壓把urllib3檔案夾扔進python安裝目錄的Lib目錄下即可

三: 下載圖片連結

這個就有意思了, 之前我是這樣寫的

f.write(requests.get(img_url).content)

報錯

File "C:\Users\Shinelon\AppData\Local\Programs\Python\Python36\lib\requests\models.py", line 379, in prepare_url    raise MissingSchema(error)requests.exceptions.MissingSchema: Invalid URL ‘//imgwx5.2345.com/dypcimg/img/c/65/sup196183_223x310.jpg‘: No schema supplied. Perhaps you meant http:////imgwx5.2345.com/dypcimg/img/c/65/sup196183_223x310.jpg?Process finished with exit code 1

圖片是這樣的,也無法進行迭代輸出下載

沒辦法,後來自己自動給連結加上http:

img_url2 = ‘http:‘ + img_url            f.write(requests.get(img_url2).content)            print(img_url2)            f.close()

然後就正常了。

附上代碼

import requestsimport bs4def get_html(url):    try:        r = requests.get(url, timeout=30)        r.raise_for_status        r.encoding = ‘gbk‘        return r.text    except:        return "someting wrong"def get_content(url):    html = get_html(url)    soup = bs4.BeautifulSoup(html, ‘lxml‘)    movieslist = soup.find(‘ul‘, class_=‘picList clearfix‘)    movies = movieslist.find_all(‘li‘)    for top in movies:        #爬取圖片src        img_url = top.find(‘img‘)[‘src‘]        #爬取影片name        name = top.find(‘span‘, class_=‘sTit‘).a.text        try:            #爬取影片發行日期            time = top.find(‘span‘, class_=‘sIntro‘).text        except:            time = "暫無發行日期"        #爬取電影角色主演        actors = top.find(‘p‘, class_=‘pActor‘)        actor = ‘‘        for act in actors.contents:            actor = actor + act.string + ‘ ‘        #爬取電影簡介        intro = top.find(‘p‘, class_=‘pTxt pIntroShow‘).text        print("片名：{}\t{}\n{}\n{} \n \n ".format(name, time, actor,intro))        #下載圖片到指定目錄        with open(‘/Users/Shinelon/Desktop/1212/‘+name+‘.png‘,‘wb+‘) as f:            img_url2 = ‘http:‘ + img_url            f.write(requests.get(img_url2).content)            print(img_url2)            f.close()def main():    url = ‘http://dianying.2345.com/top/‘    get_content(url)if __name__ == "__main__":    main()

結果

零基礎Python爬蟲實現(爬取最新電影排行)

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More