Python crawler calculates the number of blog views, brush views

Source: Internet
Author: User
Tags string format

First go to the management page of the blog Park:

By observing the A-jax request, it is found that the blog's classification (categories) is a JSON-formatted data:

So I climbed the categories first. Through the various categories of pages to crawl the address, browse the volume, open a category page:

Check Web page

This gives you the address and number of visitors to each blog.

On the code, some other issues are commented out in the code:

Import TimeImportRequestsImportJSONImportRe fromSeleniumImportWebdriverurl='https://i.cnblogs.com/categories'Base_url='https://i.cnblogs.com/posts?' views=0url_list=[]headers= {    #add your own cookie to the headers    'Cookies': ' Your own cookie ','user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36',    'upgrade-insecure-requests':'1'}pattern1= Re.compile ('<td> release </td>.*?\d.*? (\d{1,})', Re. S#The regular expression #这边尤其要注意到浏览量有可能不只是一个位数, and possibly the two-digit three-digit number, so the time to write the regular is (\d{1,})PATTERN2 = Re.compile ('<td class= "Post-title" ><a href= "(. *?)"', Re. S#Regular ExpressionsResponse = Requests.get (Url=url, headers=headers) HTML=Response.textdata= Json.loads (HTML)#Convert data into a dictionary format with Json.loadsCategories = (i['CategoryId'] forIinchdata) forCategoryinchCategories:cate_url= Base_url +'categoryid='+ STR (category)#build addresses for each categoryheaders['Referer'] =Cate_url Response= Requests.get (Cate_url, headers=headers) HTML=Response.text results1= Re.findall (pattern1, HTML)#findall Results of browsing volumeResults2 = Re.findall (pattern2, HTML)#findall results for web address    ifResults1: forResult1inchresults1:views= views + int (RESULT1)#通过int () built-in function that converts a number in string format to a numeric value in int format,calculate the number of views         forResult2inchResults2:url_list.append ('https://'+ result2)#Build AddressPrint('total views are:', views)Print(url_list) Options= Webdriver. Chromeoptions ()#simulate a Chrome browser with Webdriver in the Selenium module#set ChineseOptions.add_argument ('LANG=ZH_CN. UTF-8')#Replace the headOptions.add_argument ('user-agent= "mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36 "')#https://www.cnblogs.com/francischeng/p/9437809.html#www.cnblogs.com/francischeng/p/9437809.com 'Driver = Webdriver. Chrome (chrome_options=options) whileTrue: forUrlinchurl_list:driver.delete_all_cookies () driver.get (URL) time.sleep (2)#Sleep two seconds

Summarize.

1. The main error encountered is that the problem of crawling views in regular expressions is just beginning to write (\d), which is a number. However, the number of views may be two digits three digits, so you should change (\d{1,}) so that you can crawl multiple digits.

2. Through the simple simulation browser and can not brush the amount of browsing, maybe an IP can only add a few views a day

Python crawler calculates the number of blog views, brush views

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.