Python+selenium+requests crawl the name of my blog fan

Source: Internet
Author: User

Crawl target

1. This code is run on the Python2, python3 the most need to change 2 lines of code, used to other Python modules

    • Selenium 2.53.6 +firefox 44
    • BeautifulSoup
    • Requests

2. Crawling the target site, my blog: Https://home.cnblogs.com/u/yoyoketang
Crawl content: Crawl all of my blog's fan names and save to TXT

3. Because the Blog Park login is required for human-computer authentication, it is not possible to login directly with the account password, need to use selenium login

Selenium access to Cookies

1. Premise: Manually operate the browser first, log in to my blog, and remember the password
(Ensure that the browser is turned off, the next time you open a browser to visit my blog is logged in)
2.selenium default startup browser is an empty configuration, default does not load the configuration cache file, here first to find the corresponding browser configuration file address, in Firefox as an example
3. Use the Driver.get_cookies () method to obtain the browser's cookies

# coding:utf-8import requests From selenium import webdriverfrom BS4 import beautifulsoupimport re Import Time# Firefox browser profile address profile_directory = r ' C:\Users\ Admin\appdata\roaming\mozilla\firefox\profiles\yn80ouvt.default ' # load Configuration profile = Webdriver. Firefoxprofile (profile_directory) # start browser configuration Driver = Webdriver. Firefox (Profile) driver.get ( "https://home.cnblogs.com/u/yoyoketang/followers/") Time.sleep (3) cookies = driver.get_cookies () # Get browser cookiesprint (Cookies) driver.quit ()           

(Note: If this script launches the browser, the Open blog page is not logged in, the content will not be looked at, first check the configuration file is not written incorrectly)

Requests adding a login cookie

1. After the browser's cookies are obtained, the next step is to build a session with requests and add the cookies after the successful login to the session.

s = requests.session()  # 新建session# 添加cookies到CookieJarc = requests.cookies.RequestsCookieJar()for i in cookies:    c.set(i["name"], i[‘value‘])s.cookies.update(c) # 更新session里cookies
Count the number of fans and the total number of pages

1. Since my fan's data is paginated, there can only be 45 requests at a time, so first get the total number of fans and then calculate the total pages

 # send request R1 = S.get (" Html.parser ") # crawl my fan count fensinub = Soup.find_all (Class_=print fensinub[0].stringnum = Re.findall (u "my Fans \ ((. +?) \) ", Fensinub[0].string) print u" My fans number:%s "%str (Num[0]) # calculate how many pages, each page 45 ye = int (int (num[0])/45) +1 Print u "Total Pages:%s"%str (ye)         
Save fan name to TXT
# Grab the first page of data Fensi = Soup.find_all (class_="Avatar_name")For IIn fensi:name = i.String.Replace ("\ n",""). Replace (" ","") Print nameWith open ("Name.txt","A")As F:# Append Write F.write (Name.encode ("Utf-8") +"\ n")# Catch the second page after the dataFor IIn range (2, ye+1): r2 = S.get ("https://home.cnblogs.com/u/yoyoketang/relation/followers?page=%s"%str (i)) soup = BeautifulSoup ( R1.content, "Html.parser") # crawl My fan count Fensi = Soup.find_all (class_="Avatar_name") for I in  fensi:name = I.string.replace ("\ n", ""). Replace ("","") Print name with open ("Name.txt", "a" As f: # Append Write F.write (Name.encode ("utf-8") +"\ n")       

Reference code:
# Coding:utf-8Import requestsFrom seleniumImport WebdriverFrom BS4Import BeautifulSoupImport reImport time# Firefox browser config file address profile_directory =R ' C:\Users\admin\AppData\Roaming\Mozilla\Firefox\Profiles\yn80ouvt.default ' s = requests.session ()# New Sessionurl ="Https://home.cnblogs.com/u/yoyoketang"DefGet_cookies(URL):"' Start selenium to get login cookies 'Try# load Config profile = Webdriver. Firefoxprofile (Profile_directory)# launch Browser Configuration Driver = Webdriver. Firefox (Profile) Driver.get (url+"/followers") Time.sleep (3) cookies = Driver.get_cookies ()# Get Browser cookies print (cookies) driver.quit ()return cookiesExcept ExceptionAs Msg:print (U "Start browser error:%s"%STR (msg))DefAdd_cookies(cookies):"Add cookies to Session"Try# Add cookies to Cookiejar C = Requests.cookies.RequestsCookieJar ()For IIn Cookies:c.set (i["Name"], i[' Value ']) s.cookies.update (c)# Update Session CookiesExcept ExceptionAs Msg:print (U "error when adding cookies:%s"% str (msg))DefGet_ye_nub(URL):"The number of pages to get fans"Try# Send Request R1 = S.get (url+"/relation/followers") soup = BeautifulSoup (r1.content,"Html.parser")# crawl My fan count fensinub = Soup.find_all (class_="Current_nav") print (fensinub[0].string) num = Re.findall (U "my Fans \ ((. +?) \) ", fensinub[0].string) Print (U "number of my fans:%s"%str (num[0]))# Calculate how many pages, each page 45 ye = int (int (num[0])/45) +1 Print (U "Total number of pages:%s"%str (Ye))return yeExcept ExceptionAs Msg:print (U "Get fan page error, default return number 1:%s"%STR (msg))Return1DefSave_name(NUB):"Crawl page fan name"Try# Grab the first page of dataIf NUB <=1:url_page = url+"/relation/followers"Else:url_page = url+"/relation/followers?page=%s"% str (NUB) print (U "Crawling page:%s"%url_page) r2 = S.get (Url_page, verify=False) soup = BeautifulSoup (r2.content,"Html.parser") Fensi = Soup.find_all (class_="Avatar_name")For IIn fensi:name = I.string.replace ("\ n",""). Replace (" ","") Print (name)With open ( "Name.txt",  "a") as F: # Append write F.write (Name.encode (" Utf-8 ") +" \ n ") # Python3 changed to the following two lines # with open ("Name.txt", "a", encoding= "Utf-8") as F: # Append write # f.write (name+ "\ n") except Exception as msg:print (u" grabbed the fan name during the error:%s "%STR (msg)) if _ _name__ = =  "__main__": cookies = get_cookies (URL) add_cookies (cookies) n = get_ye_nub (URL) Span class= "Hljs-keyword" >for i in list (range (1, N+1)): Save_name (i)            

Python+selenium+requests crawl my blog fan name

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.