The first python crawler and the first python Crawler
1. Install the Python Environment
Official Website: https://www.python.org/download the installation program matching the operating system, install and configure Environment Variables
2. IntelliJ Idea install Python plug-in
I used idea to search for plug-ins and install them directly in tools (Baidu)
3. Install the beautifulSoup plug-in
Https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#attributes
4. crawler: The Flash content of the blog Garden
#! /Usr/bin/python #-*-coding: UTF-8-*-import urllib2import timeimport bs4 ''' ing .cnblogs.com crawling class ''' class CnBlogsSpider: url = "https://ing.cnblogs.com/ajax/ing/GetIngList? IngListType = All & PageIndex =$ {pageNo} & PageSize = 30 & Tag = & _ = "# Get html def getHtml (self): request = urllib2.Request (self. pageUrl) response = urllib2.urlopen (request) self.html = response. read () # parse html def analyze (self): self. getHtml () bSoup = bs4.BeautifulSoup(self.html) divs = bSoup. find_all ("div", class _ = 'ing-item') for div in divs: img = div. find ("img") ['src'] item = div. find ("div", class _ = 'feed _ body') userName = item. find ("a", class _ = 'ing-author '). text = item. find ("span", class _ = 'ing _ body '). text pubtime = item. find ("a", class _ = 'ing _ Time '). text star = item. find ("img", class _ = 'ing-icon ') and True or False print' (Avatar: ', img, 'nickname:', userName, ', flash: ', text,', time: ', pubtime,', star: ', star,') 'def run (self, page ): pageNo = 1 while (pageNo <= page): self. pageUrl = self. url. replace ('$ {pageNo}', str (pageNo) + str (int (time. time () print '------------- \ r \ n ', pageNo, 'the page data is as follows:', self. pageUrl self. analyze () pageNo = pageNo + 1 CnBlogsSpider (). run (3)
5. Execution result