First go to the management page of the blog Park:
By observing the A-jax request, it is found that the blog's classification (categories) is a JSON-formatted data:
So I climbed the categories first. Through the various categories of pages to crawl the address, browse the volume, open a category page:
Check Web page
This gives you the address and number of visitors to each blog.
On the code, some other issues are commented out in the code:
Import TimeImportRequestsImportJSONImportRe fromSeleniumImportWebdriverurl='https://i.cnblogs.com/categories'Base_url='https://i.cnblogs.com/posts?' views=0url_list=[]headers= { #add your own cookie to the headers 'Cookies': ' Your own cookie ','user-agent':'mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36', 'upgrade-insecure-requests':'1'}pattern1= Re.compile ('<td> release </td>.*?\d.*? (\d{1,})', Re. S#The regular expression #这边尤其要注意到浏览量有可能不只是一个位数, and possibly the two-digit three-digit number, so the time to write the regular is (\d{1,})PATTERN2 = Re.compile ('<td class= "Post-title" ><a href= "(. *?)"', Re. S#Regular ExpressionsResponse = Requests.get (Url=url, headers=headers) HTML=Response.textdata= Json.loads (HTML)#Convert data into a dictionary format with Json.loadsCategories = (i['CategoryId'] forIinchdata) forCategoryinchCategories:cate_url= Base_url +'categoryid='+ STR (category)#build addresses for each categoryheaders['Referer'] =Cate_url Response= Requests.get (Cate_url, headers=headers) HTML=Response.text results1= Re.findall (pattern1, HTML)#findall Results of browsing volumeResults2 = Re.findall (pattern2, HTML)#findall results for web address ifResults1: forResult1inchresults1:views= views + int (RESULT1)#通过int () built-in function that converts a number in string format to a numeric value in int format,calculate the number of views forResult2inchResults2:url_list.append ('https://'+ result2)#Build AddressPrint('total views are:', views)Print(url_list) Options= Webdriver. Chromeoptions ()#simulate a Chrome browser with Webdriver in the Selenium module#set ChineseOptions.add_argument ('LANG=ZH_CN. UTF-8')#Replace the headOptions.add_argument ('user-agent= "mozilla/5.0 (Windows NT 10.0; Win64; x64) applewebkit/537.36 (khtml, like Gecko) chrome/67.0.3396.99 safari/537.36 "')#https://www.cnblogs.com/francischeng/p/9437809.html#www.cnblogs.com/francischeng/p/9437809.com 'Driver = Webdriver. Chrome (chrome_options=options) whileTrue: forUrlinchurl_list:driver.delete_all_cookies () driver.get (URL) time.sleep (2)#Sleep two seconds
Summarize.
1. The main error encountered is that the problem of crawling views in regular expressions is just beginning to write (\d), which is a number. However, the number of views may be two digits three digits, so you should change (\d{1,}) so that you can crawl multiple digits.
2. Through the simple simulation browser and can not brush the amount of browsing, maybe an IP can only add a few views a day
Python crawler calculates the number of blog views, brush views