python爬蟲的基本知識儲備

最後更新：2018-08-23 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：query 擷取構造 url cti 其他 imp 就是 save

1.關於引用全域變數：

　　引用全域變數並不是拿來就可以用，拿來就可以改的，當在子函數中引用全域變數的時候，應該聲明這個變數是全域變數：如global test，全域變數test。具體：17735159
2.關於尋找網頁的原始圖片：

　　一般來說，顯示在網頁上面的圖片是經過壓縮的縮圖片，但是我們想要爬取的卻是高清的原圖，那麼這個時候我們就可以右鍵顯示網頁源碼，到網頁源碼裡面找，一般來說，都是可以找到原圖的連結的，打個比方，百度圖片的原圖連結是在一個objURL的對象之下的，ctrl+f尋找一下就可以找到了，其他的網站估計也差不多，仔細找就好

3.關於下一個網頁連結：

　　有時候網頁連結非常的長，比如百度圖庫的連結就是很臭很長，所以通過：觀察網頁規律然後傳入參數構造下一個頁面的連結，這個方法顯然行不通。那麼這個時候我們就要尋找另外一個方法了，那就是：右鍵先進入網頁源碼，然後在源碼中檢索頁面當中顯示的
“下一頁“這樣的詞彙，還是拿百度圖庫來做例子：先右上方切換翻頁模式，然後在網頁源碼當中檢索。貼圖如下：

4.最後在說一下最重要的一個知識點，就是網頁的中文解碼：

　　當我們用requestes庫的get函數請求成功之後，我們想把網頁的源碼儲存下來，但是我們儲存之後發現，網頁源碼當中的中文字元，不管怎麼儲存都是亂碼的，這時候儲存之前就要用上這個句子：r.encoding = r.apparent_encoding，r.apparent_encoding表示擷取網頁的正確編碼方式，那麼這句話得到意思就是讓網頁的編碼方式等於他正確的編碼方式（網上原話），然後在儲存的時候with ope(‘file.txt‘,‘w‘,encoding = ‘utf-8‘) as f:.........。這樣儲存下來的檔案就不會是中文亂碼的了。

附上一段代碼：

import osimport requestsimport jsonfrom hashlib import md5from multiprocessing.pool import Poolfrom pyquery import PyQuery as pqfrom fake_useragent import  UserAgentfrom urllib.parse import quoteimport timeimport reurl_list = []page_num = 1headers = {        ‘User-Agent‘ : ‘ua.random()‘    }def get_one_page(url):    global page_num    ua = UserAgent()    try :        r = requests.get(url=url, headers=headers)        if r.status_code == 200:            print ("page %s status_code = %s" % (page_num,r.status_code))            page_num = page_num + 1            return r.text    except requests.ConnectionError:        return Nonedef get_image_list(html):    global url_list    image_list = []    pattern_1 = re.compile(‘objURL":"(.*?)",‘,re.S)    list = re.findall(pattern_1,html)    for item in list:        image_list.append(item)    pattern_2 = re.compile(‘<strong><span class="pc"(.*?)<a href="(.*?)"><span class="pc" data="right"‘, re.S)    list_2 = re.findall(pattern_2,html)    next_url = ‘https://image.baidu.com‘ + ‘‘.join(list_2[0][1])    url_list.append(next_url)    return image_listdef save_image(image_list):    if not os.path.exists(‘picture‘):        os.mkdir(‘picture‘)    try:        for item in image_list:            #print (item)            response = requests.get(url = item,headers = headers)            file_path = ‘{0}/{1}.{2}‘.format(‘picture‘, md5(response.content).hexdigest(),‘jpg‘)            if not os.path.exists(file_path):                with open(file_path,‘wb‘) as f:                    f.write(response.content)                    print ("success download: " + file_path)            else :                print ("already down" + file_path)            time.sleep(2)    except:        print ("fail to download")if __name__ == ‘__main__‘:    keyword = input("輸入要爬取的關鍵詞：")   #要爬取的內容    page = input("輸入要爬取的頁數：")      #要爬取的頁數    keyword = str(keyword)    page = int(page)    keyword = quote(keyword)    url = ‘https://image.baidu.com/search/flip?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=result&fr=&sf=1&fmq=1535006333854_R&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&ctd=1535006333855%5E00_1903X943&word=‘ + keyword    url_list.append(url)    for each in range(page):        html = get_one_page(url_list[each])        print (url_list[each])        image_list = get_image_list(html)        #print (image_list)        save_image(image_list)

View Code

python爬蟲的基本知識儲備

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More