Python crawler calculates the number of blog views, brush views

Last Update:2018-09-07 Source: Internet

Author: User

Tags string format

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First go to the management page of the blog Park:

By observing the A-jax request, it is found that the blog's classification (categories) is a JSON-formatted data:

So I climbed the categories first. Through the various categories of pages to crawl the address, browse the volume, open a category page:

Check Web page

This gives you the address and number of visitors to each blog.

On the code, some other issues are commented out in the code:

Import TimeImportRequestsImportJSONImportRe fromSeleniumImportWebdriverurl='https://i.cnblogs.com/categories'Base_url='https://i.cnblogs.com/posts?' views=0url_list=[]headers= {    #add your own cookie to the headers    'Cookies': ' Your own cookie ','user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36',    'upgrade-insecure-requests':'1'}pattern1= Re.compile ('<td> release </td>.*?\d.*? (\d{1,})', Re. S#The regular expression #这边尤其要注意到浏览量有可能不只是一个位数, and possibly the two-digit three-digit number, so the time to write the regular is (\d{1,})PATTERN2 = Re.compile ('<td class= "Post-title" ><a href= "(. *?)"', Re. S#Regular ExpressionsResponse = Requests.get (Url=url, headers=headers) HTML=Response.textdata= Json.loads (HTML)#Convert data into a dictionary format with Json.loadsCategories = (i['CategoryId'] forIinchdata) forCategoryinchCategories:cate_url= Base_url +'categoryid='+ STR (category)#build addresses for each categoryheaders['Referer'] =Cate_url Response= Requests.get (Cate_url, headers=headers) HTML=Response.text results1= Re.findall (pattern1, HTML)#findall Results of browsing volumeResults2 = Re.findall (pattern2, HTML)#findall results for web address    ifResults1: forResult1inchresults1:views= views + int (RESULT1)#通过int () built-in function that converts a number in string format to a numeric value in int format,calculate the number of views         forResult2inchResults2:url_list.append ('https://'+ result2)#Build AddressPrint('total views are:', views)Print(url_list) Options= Webdriver. Chromeoptions ()#simulate a Chrome browser with Webdriver in the Selenium module#set ChineseOptions.add_argument ('LANG=ZH_CN. UTF-8')#Replace the headOptions.add_argument ('user-agent= "mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36 "')#https://www.cnblogs.com/francischeng/p/9437809.html#www.cnblogs.com/francischeng/p/9437809.com 'Driver = Webdriver. Chrome (chrome_options=options) whileTrue: forUrlinchurl_list:driver.delete_all_cookies () driver.get (URL) time.sleep (2)#Sleep two seconds

Summarize.

1. The main error encountered is that the problem of crawling views in regular expressions is just beginning to write (\d), which is a number. However, the number of views may be two digits three digits, so you should change (\d{1,}) so that you can crawl multiple digits.

2. Through the simple simulation browser and can not brush the amount of browsing, maybe an IP can only add a few views a day

Python crawler calculates the number of blog views, brush views

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More