Python crawler One of the crawl to get embarrassing encyclopedia jokes

Last Update:2017-12-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reference: http://cuiqingcai.com/990.html

1. Non-object-oriented mode

Full Code 1:

#-*-Coding:utf-8-*-
Import re
Import Urllib2
Import Urllib
Import Thread
Import time

page = 1
url = ' http://www.qiushibaike.com/hot/page/' + str (page)
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
headers = {' User-agent ': user_agent}
Try
Request = Urllib2. Request (url,headers = headers)
Response = Urllib2.urlopen (Request)
Content = Response.read (). Decode (' Utf-8 ')
Pattern = re.compile (' <div.*?class= ' author.*?+ ' <span.*?class= ' dash ' >.*?<i.*?class= ' number ' > (. *?) </i> ', Re. S
Items = Re.findall (pattern, content)
For item in items:
Print Item[0],item[1],item[2],item[3]
Except Urllib2. Urlerror, E:
If Hasattr (E, "code"):
Print E.code
If Hasattr (E, "Reason"):
Print E.reason
The results of the operation are as follows:

Note 1: There is no need to log in for embarrassing Wikipedia, so there is no need to use cookies.

2. Object-oriented mode

The above code is the most central part, the following is what we want to achieve:

Press ENTER to read a joke that shows the publisher of the satin, the content posted, the number of likes and the number of comments.

In addition, we need to design object-oriented patterns, introduce classes and methods, and optimize and encapsulate the code .

Full Code 2 :

#-*-Coding:utf-8-*-
ImportRe
ImportUrllib2
ImportUrllib
ImportThread
ImportTime

# embarrassing Encyclopedia crawler
classQSBK:
    # Initialize method, define some variables
def__init__( Self):
         Self. PageIndex =1
 Self. user_agent =' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
# Initialize Headers
 Self. headers = {' User-agent ': Self. user_agent}
        # The variables that store the jokes, every element is a piece of the page
 Self. stories = []
        # variables to keep the program running
 Self. Enable =False

# Gets the page code for the index passed to a page
defGetPage ( Self, PageIndex):
        Try:
URL =' http://www.qiushibaike.com/hot/page/'+Str(PageIndex)
            # Request Requests for Build
Request = Urllib2. Request (URL,Headers= Self. Headers)
            # get page code with Urlopen
Response = Urllib2.urlopen (Request)
            # Convert the page to UTF-8 encoding
Pagecode = Response.read (). Decode (' Utf-8 ')
            returnPagecode
        exceptUrllib2. Urlerror, E:
            ifhasattrE"Reason"):
                PrintU "Connection embarrassing encyclopedia failure, cause of error", E.reason
                returnNone

# Pass in a page of code and return to the list of jokes on this page
defGetpageitems ( Self, PageIndex):
Pagecode = Self. GetPage (PageIndex)
        if notPagecode:
            Print"Page load failed ...."
returnNone
Printitem[0], item[1], item[2], item[3]

Pattern = Re.compile (' <div.*?class= ' author.*?' <span.*?class= ' stats-vote ">.*?<i.*?class=" Number "> (. *?) </i>.*? '
+' <span.*?class= ' dash ' >.*?<i.*?class= ' number ' > (. *?) </i> ', Re. S
Items = Re.findall (Pattern,pagecode)
        # to store every page of the jokes
Pagestories = []
        # Traversal of regular expression matching information
 forIteminchItems
Replacebr = Re.compile (' <br/> ')
Text = Re.sub (REPLACEBR,"\ n", item[1])
            # Item[0] is a joke publisher, Item[1] is content, item[2] likes number, item[3] comments
Pagestories.append ([item[0].strip (), Text.strip (), item[2].strip (), item[3].strip ()])
        returnPagestories
    # load and extract the contents of the page and add it to the list
defLoadPage ( Self):
        # Load a new page if the number of pages not currently viewed is less than 2 pages
if Self. Enable = =True:
            ifLen( Self. stories) <2:
                # Get a new page
Pagestories = Self. Getpageitems ( Self. PageIndex)
                # Store The page's jokes in the global list
ifPagestories:
                     Self. Stories.append (Pagestories)
                    # After getting the page number index plus one, indicating next read Next page
 Self. PageIndex + =1

# Call this method every time you hit Enter to print out a satin
defGetonestory ( Self, Pagestories, page):
        # Traversing a page of jokes
 forStoryinchPagestories:
            # Wait for user input
input =Raw_input()
            # Whenever you enter a carriage return, determine if you want to load a new page
 Self. LoadPage ()
            # If you enter q then the program ends
offinput = ="Q":
                 Self. Enable =False
return
Printu "page%d\ tPublisher:%s\ tComment:%s\ tLikes:%s\ n%s "% (page, story[0], story[3], story[2], story[1])

    # Start method
defStart Self):
        PrintU "is reading embarrassing encyclopedia, press ENTER to view new jokes, Q quit"
# Make the variable true and the program will run correctly
 Self. Enable =True
# Load one page of content first
 Self. LoadPage ()
        # local variable, control is currently read to the first page
Nowpage =0
 while Self. Enable:
            ifLen( Self. stories) >0:
                # Get a page of jokes from the global list
Pagestories = Self. stories[0]
                # Number of pages currently read plus one
Nowpage + =1
# Remove the first element in the global list because it has been removed
del Self. stories[0]
                # Output The page's satin
 Self. Getonestory (Pagestories, Nowpage)
Spider = QSBK ()
Spider.start ()

The results of the operation are as follows:

Designing object-oriented patterns

Python crawler One of the crawl to get embarrassing encyclopedia jokes

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler One of the crawl to get embarrassing encyclopedia jokes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler One of the crawl to get embarrassing encyclopedia jokes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support