Python+selenium+requests crawl the name of my blog fan

Last Update:2018-05-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Crawl target

1. This code is run on the Python2, python3 the most need to change 2 lines of code, used to other Python modules

Selenium 2.53.6 +firefox 44
BeautifulSoup
Requests

2. Crawling the target site, my blog: Https://home.cnblogs.com/u/yoyoketang
Crawl content: Crawl all of my blog's fan names and save to TXT

3. Because the Blog Park login is required for human-computer authentication, it is not possible to login directly with the account password, need to use selenium login

Selenium access to Cookies

1. Premise: Manually operate the browser first, log in to my blog, and remember the password
(Ensure that the browser is turned off, the next time you open a browser to visit my blog is logged in)
2.selenium default startup browser is an empty configuration, default does not load the configuration cache file, here first to find the corresponding browser configuration file address, in Firefox as an example
3. Use the Driver.get_cookies () method to obtain the browser's cookies

# coding:utf-8import requests From selenium import webdriverfrom BS4 import beautifulsoupimport re Import Time# Firefox browser profile address profile_directory = r ' C:\Users\ Admin\appdata\roaming\mozilla\firefox\profiles\yn80ouvt.default ' # load Configuration profile = Webdriver. Firefoxprofile (profile_directory) # start browser configuration Driver = Webdriver. Firefox (Profile) driver.get ( "https://home.cnblogs.com/u/yoyoketang/followers/") Time.sleep (3) cookies = driver.get_cookies () # Get browser cookiesprint (Cookies) driver.quit ()

(Note: If this script launches the browser, the Open blog page is not logged in, the content will not be looked at, first check the configuration file is not written incorrectly)

Requests adding a login cookie

1. After the browser's cookies are obtained, the next step is to build a session with requests and add the cookies after the successful login to the session.

s = requests.session()  # 新建session# 添加cookies到CookieJarc = requests.cookies.RequestsCookieJar()for i in cookies:    c.set(i["name"], i[‘value‘])s.cookies.update(c) # 更新session里cookies

Count the number of fans and the total number of pages

1. Since my fan's data is paginated, there can only be 45 requests at a time, so first get the total number of fans and then calculate the total pages

 # send request R1 = S.get (" Html.parser ") # crawl my fan count fensinub = Soup.find_all (Class_=print fensinub[0].stringnum = Re.findall (u "my Fans \ ((. +?) \) ", Fensinub[0].string) print u" My fans number:%s "%str (Num[0]) # calculate how many pages, each page 45 ye = int (int (num[0])/45) +1 Print u "Total Pages:%s"%str (ye)

Save fan name to TXT

# Grab the first page of data Fensi = Soup.find_all (class_="Avatar_name")For IIn fensi:name = i.String.Replace ("\ n",""). Replace (" ","") Print nameWith open ("Name.txt","A")As F:# Append Write F.write (Name.encode ("Utf-8") +"\ n")# Catch the second page after the dataFor IIn range (2, ye+1): r2 = S.get ("https://home.cnblogs.com/u/yoyoketang/relation/followers?page=%s"%str (i)) soup = BeautifulSoup ( R1.content, "Html.parser") # crawl My fan count Fensi = Soup.find_all (class_="Avatar_name") for I in  fensi:name = I.string.replace ("\ n", ""). Replace ("","") Print name with open ("Name.txt", "a" As f: # Append Write F.write (Name.encode ("utf-8") +"\ n")

Reference code:

# Coding:utf-8Import requestsFrom seleniumImport WebdriverFrom BS4Import BeautifulSoupImport reImport time# Firefox browser config file address profile_directory =R ' C:\Users\admin\AppData\Roaming\Mozilla\Firefox\Profiles\yn80ouvt.default ' s = requests.session ()# New Sessionurl ="Https://home.cnblogs.com/u/yoyoketang"DefGet_cookies(URL):"' Start selenium to get login cookies 'Try# load Config profile = Webdriver. Firefoxprofile (Profile_directory)# launch Browser Configuration Driver = Webdriver. Firefox (Profile) Driver.get (url+"/followers") Time.sleep (3) cookies = Driver.get_cookies ()# Get Browser cookies print (cookies) driver.quit ()return cookiesExcept ExceptionAs Msg:print (U "Start browser error:%s"%STR (msg))DefAdd_cookies(cookies):"Add cookies to Session"Try# Add cookies to Cookiejar C = Requests.cookies.RequestsCookieJar ()For IIn Cookies:c.set (i["Name"], i[' Value ']) s.cookies.update (c)# Update Session CookiesExcept ExceptionAs Msg:print (U "error when adding cookies:%s"% str (msg))DefGet_ye_nub(URL):"The number of pages to get fans"Try# Send Request R1 = S.get (url+"/relation/followers") soup = BeautifulSoup (r1.content,"Html.parser")# crawl My fan count fensinub = Soup.find_all (class_="Current_nav") print (fensinub[0].string) num = Re.findall (U "my Fans \ ((. +?) \) ", fensinub[0].string) Print (U "number of my fans:%s"%str (num[0]))# Calculate how many pages, each page 45 ye = int (int (num[0])/45) +1 Print (U "Total number of pages:%s"%str (Ye))return yeExcept ExceptionAs Msg:print (U "Get fan page error, default return number 1:%s"%STR (msg))Return1DefSave_name(NUB):"Crawl page fan name"Try# Grab the first page of dataIf NUB <=1:url_page = url+"/relation/followers"Else:url_page = url+"/relation/followers?page=%s"% str (NUB) print (U "Crawling page:%s"%url_page) r2 = S.get (Url_page, verify=False) soup = BeautifulSoup (r2.content,"Html.parser") Fensi = Soup.find_all (class_="Avatar_name")For IIn fensi:name = I.string.replace ("\ n",""). Replace (" ","") Print (name)With open ( "Name.txt",  "a") as F: # Append write F.write (Name.encode (" Utf-8 ") +" \ n ") # Python3 changed to the following two lines # with open ("Name.txt", "a", encoding= "Utf-8") as F: # Append write # f.write (name+ "\ n") except Exception as msg:print (u" grabbed the fan name during the error:%s "%STR (msg)) if _ _name__ = =  "__main__": cookies = get_cookies (URL) add_cookies (cookies) n = get_ye_nub (URL) Span class= "Hljs-keyword" >for i in list (range (1, N+1)): Save_name (i)

Python+selenium+requests crawl my blog fan name

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python+selenium+requests crawl the name of my blog fan

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python+selenium+requests crawl the name of my blog fan

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support