python+selenium+requests爬取我的部落格粉絲的名稱

來源:互聯網
上載者:User

標籤:write   指令碼   firefox   預設   pen   計算   bubuko   parse   dmi   

爬取目標

1.本次代碼是在python2上運行通過的,python3的最需改2行代碼,用到其它python模組

  • selenium 2.53.6 +firefox 44
  • BeautifulSoup
  • requests

2.爬取目標網站,我的部落格:https://home.cnblogs.com/u/yoyoketang
爬取內容:爬我的部落格的所有粉絲的名稱,並儲存到txt

3.由於部落格園的登入是需要人機驗證的,所以是無法直接用帳號密碼登入,需藉助selenium登入

selenium擷取cookies

1.大前提:先手工操作瀏覽器,登入我的部落格,並記住密碼
(保證關掉瀏覽器後,下次開啟瀏覽器訪問我的部落格時候是登入狀態)
2.selenium預設啟動瀏覽器是一個空的配置,預設不載入配置快取檔案,這裡先得找到對應瀏覽器的設定檔地址,以Firefox瀏覽器為例
3.使用driver.get_cookies()方法擷取瀏覽器的cookies

# coding:utf-8import requestsfrom selenium import webdriverfrom bs4 import BeautifulSoupimport reimport time# firefox瀏覽器設定檔地址profile_directory = r‘C:\Users\admin\AppData\Roaming\Mozilla\Firefox\Profiles\yn80ouvt.default‘# 載入配置profile = webdriver.FirefoxProfile(profile_directory)# 啟動瀏覽器配置driver = webdriver.Firefox(profile)driver.get("https://home.cnblogs.com/u/yoyoketang/followers/")time.sleep(3)cookies = driver.get_cookies()  # 擷取瀏覽器cookiesprint(cookies)driver.quit()

(註:要是這裡指令碼啟動瀏覽器後,開啟的部落格頁面是未登入的,後面內容都不用看了,先檢查設定檔是不是寫錯了)

requests添加登入的cookies

1.瀏覽器的cookies擷取到後,接下來用requests去建一個session,在session裡添加登入成功後的cookies

s = requests.session()  # 建立session# 添加cookies到CookieJarc = requests.cookies.RequestsCookieJar()for i in cookies:    c.set(i["name"], i[‘value‘])s.cookies.update(c)  # 更新session裡cookies
計算粉絲數和分頁總數

1.由於我的粉絲的資料是分頁展示的,這裡一次只能請求到45個,所以先擷取粉絲總數,然後計算出總的頁數

# 發請求r1 = s.get("https://home.cnblogs.com/u/yoyoketang/relation/followers")soup = BeautifulSoup(r1.content, "html.parser")# 抓取我的粉絲數fensinub = soup.find_all(class_="current_nav")print fensinub[0].stringnum = re.findall(u"我的粉絲\((.+?)\)", fensinub[0].string)print u"我的粉絲數量:%s"%str(num[0])# 計算有多少頁,每頁45條ye = int(int(num[0])/45)+1print u"總共分頁數:%s"%str(ye)
儲存粉絲名到txt
# 抓取第一頁的資料fensi = soup.find_all(class_="avatar_name")for i in fensi:    name = i.string.replace("\n", "").replace(" ","")    print name    with open("name.txt", "a") as f:  # 追加寫入        f.write(name.encode("utf-8")+"\n")# 抓第二頁後的資料for i in range(2, ye+1):    r2 = s.get("https://home.cnblogs.com/u/yoyoketang/relation/followers?page=%s"%str(i))    soup = BeautifulSoup(r1.content, "html.parser")    # 抓取我的粉絲數    fensi = soup.find_all(class_="avatar_name")    for i in fensi:        name = i.string.replace("\n", "").replace(" ","")        print name        with open("name.txt", "a") as f:  # 追加寫入            f.write(name.encode("utf-8")+"\n")

參考代碼:
# coding:utf-8import requestsfrom selenium import webdriverfrom bs4 import BeautifulSoupimport reimport time# firefox瀏覽器設定檔地址profile_directory = r‘C:\Users\admin\AppData\Roaming\Mozilla\Firefox\Profiles\yn80ouvt.default‘s = requests.session()  # 建立sessionurl = "https://home.cnblogs.com/u/yoyoketang"def get_cookies(url):    ‘‘‘啟動selenium擷取登入的cookies‘‘‘    try:        # 載入配置        profile = webdriver.FirefoxProfile(profile_directory)        # 啟動瀏覽器配置        driver = webdriver.Firefox(profile)        driver.get(url+"/followers")        time.sleep(3)        cookies = driver.get_cookies()  # 擷取瀏覽器cookies        print(cookies)        driver.quit()        return cookies    except Exception as msg:        print(u"啟動瀏覽器報錯了:%s" %str(msg))def add_cookies(cookies):    ‘‘‘往session添加cookies‘‘‘    try:        # 添加cookies到CookieJar        c = requests.cookies.RequestsCookieJar()        for i in cookies:            c.set(i["name"], i[‘value‘])        s.cookies.update(c)  # 更新session裡cookies    except Exception as msg:        print(u"添加cookies的時候報錯了:%s" % str(msg))def get_ye_nub(url):    ‘‘‘擷取粉絲的頁面數量‘‘‘    try:        # 發請求        r1 = s.get(url+"/relation/followers")        soup = BeautifulSoup(r1.content, "html.parser")        # 抓取我的粉絲數        fensinub = soup.find_all(class_="current_nav")        print(fensinub[0].string)        num = re.findall(u"我的粉絲\((.+?)\)", fensinub[0].string)        print(u"我的粉絲數量:%s"%str(num[0]))        # 計算有多少頁,每頁45條        ye = int(int(num[0])/45)+1        print(u"總共分頁數:%s"%str(ye))        return ye    except Exception as msg:        print(u"擷取粉絲頁數報錯了,預設返回數量1 :%s"%str(msg))        return 1def save_name(nub):    ‘‘‘抓取頁面的粉絲名稱‘‘‘    try:        # 抓取第一頁的資料        if nub <= 1:            url_page = url+"/relation/followers"        else:            url_page = url+"/relation/followers?page=%s" % str(nub)        print(u"正在抓取的頁面:%s" %url_page)        r2 = s.get(url_page, verify=False)        soup = BeautifulSoup(r2.content, "html.parser")        fensi = soup.find_all(class_="avatar_name")        for i in fensi:            name = i.string.replace("\n", "").replace(" ","")            print(name)            with open("name.txt", "a") as f:  # 追加寫入                f.write(name.encode("utf-8")+"\n")            # python3的改成下面這兩行            # with open("name.txt", "a", encoding="utf-8") as f:  # 追加寫入            #     f.write(name+"\n")         except Exception as msg:        print(u"抓取粉絲名稱過程中報錯了 :%s"%str(msg))if __name__ == "__main__":    cookies = get_cookies(url)    add_cookies(cookies)    n = get_ye_nub(url)    for i in list(range(1, n+1)):        save_name(i)

python+selenium+requests爬取我的部落格粉絲的名稱

相關文章

聯繫我們

該頁面正文內容均來源於網絡整理,並不代表阿里雲官方的觀點,該頁面所提到的產品和服務也與阿里云無關,如果該頁面內容對您造成了困擾,歡迎寫郵件給我們,收到郵件我們將在5個工作日內處理。

如果您發現本社區中有涉嫌抄襲的內容,歡迎發送郵件至: info-contact@alibabacloud.com 進行舉報並提供相關證據,工作人員會在 5 個工作天內聯絡您,一經查實,本站將立刻刪除涉嫌侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.