Python crawler One of the crawl to get embarrassing encyclopedia jokes

Source: Internet
Author: User

Reference: http://cuiqingcai.com/990.html

1. Non-object-oriented mode

Full Code 1:

#-*-Coding:utf-8-*-
Import re
Import Urllib2
Import Urllib
Import Thread
Import time

page = 1
url = ' http://www.qiushibaike.com/hot/page/' + str (page)
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
headers = {' User-agent ': user_agent}
Try
Request = Urllib2. Request (url,headers = headers)
Response = Urllib2.urlopen (Request)
Content = Response.read (). Decode (' Utf-8 ')
Pattern = re.compile (' <div.*?class= ' author.*?+ ' <span.*?class= ' dash ' >.*?<i.*?class= ' number ' > (. *?) </i> ', Re. S
Items = Re.findall (pattern, content)
For item in items:
Print Item[0],item[1],item[2],item[3]
Except Urllib2. Urlerror, E:
If Hasattr (E, "code"):
Print E.code
If Hasattr (E, "Reason"):
Print E.reason
The results of the operation are as follows:

Note 1: There is no need to log in for embarrassing Wikipedia, so there is no need to use cookies.

2. Object-oriented mode

The above code is the most central part, the following is what we want to achieve:

Press ENTER to read a joke that shows the publisher of the satin, the content posted, the number of likes and the number of comments.

In addition, we need to design object-oriented patterns, introduce classes and methods, and optimize and encapsulate the code .

Full Code 2 :

#-*-Coding:utf-8-*-
ImportRe
ImportUrllib2
ImportUrllib
ImportThread
ImportTime

# embarrassing Encyclopedia crawler
classQSBK:
# Initialize method, define some variables
def__init__( Self):
Self. PageIndex =1
Self. user_agent =' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
# Initialize Headers
Self. headers = {' User-agent ': Self. user_agent}
# The variables that store the jokes, every element is a piece of the page
Self. stories = []
# variables to keep the program running
Self. Enable =False

# Gets the page code for the index passed to a page
defGetPage ( Self, PageIndex):
Try:
URL =' http://www.qiushibaike.com/hot/page/'+Str(PageIndex)
# Request Requests for Build
Request = Urllib2. Request (URL,Headers= Self. Headers)
# get page code with Urlopen
Response = Urllib2.urlopen (Request)
# Convert the page to UTF-8 encoding
Pagecode = Response.read (). Decode (' Utf-8 ')
returnPagecode
exceptUrllib2. Urlerror, E:
ifhasattrE"Reason"):
PrintU "Connection embarrassing encyclopedia failure, cause of error", E.reason
returnNone

# Pass in a page of code and return to the list of jokes on this page
defGetpageitems ( Self, PageIndex):
Pagecode = Self. GetPage (PageIndex)
if notPagecode:
Print"Page load failed ...."
returnNone
Printitem[0], item[1], item[2], item[3]

Pattern = Re.compile (' <div.*?class= ' author.*?' <span.*?class= ' stats-vote ">.*?<i.*?class=" Number "> (. *?) </i>.*? '
+' <span.*?class= ' dash ' >.*?<i.*?class= ' number ' > (. *?) </i> ', Re. S
Items = Re.findall (Pattern,pagecode)
# to store every page of the jokes
Pagestories = []
# Traversal of regular expression matching information
forIteminchItems
Replacebr = Re.compile (' <br/> ')
Text = Re.sub (REPLACEBR,"\ n", item[1])
# Item[0] is a joke publisher, Item[1] is content, item[2] likes number, item[3] comments
Pagestories.append ([item[0].strip (), Text.strip (), item[2].strip (), item[3].strip ()])
returnPagestories
# load and extract the contents of the page and add it to the list
defLoadPage ( Self):
# Load a new page if the number of pages not currently viewed is less than 2 pages
if Self. Enable = =True:
ifLen( Self. stories) <2:
# Get a new page
Pagestories = Self. Getpageitems ( Self. PageIndex)
# Store The page's jokes in the global list
ifPagestories:
Self. Stories.append (Pagestories)
# After getting the page number index plus one, indicating next read Next page
Self. PageIndex + =1

# Call this method every time you hit Enter to print out a satin
defGetonestory ( Self, Pagestories, page):
# Traversing a page of jokes
forStoryinchPagestories:
# Wait for user input
input =Raw_input()
# Whenever you enter a carriage return, determine if you want to load a new page
Self. LoadPage ()
# If you enter q then the program ends
offinput = ="Q":
Self. Enable =False
return
Printu "page%d\ tPublisher:%s\ tComment:%s\ tLikes:%s\ n%s "% (page, story[0], story[3], story[2], story[1])

# Start method
defStart Self):
PrintU "is reading embarrassing encyclopedia, press ENTER to view new jokes, Q quit"
# Make the variable true and the program will run correctly
Self. Enable =True
# Load one page of content first
Self. LoadPage ()
# local variable, control is currently read to the first page
Nowpage =0
while Self. Enable:
ifLen( Self. stories) >0:
# Get a page of jokes from the global list
Pagestories = Self. stories[0]
# Number of pages currently read plus one
Nowpage + =1
# Remove the first element in the global list because it has been removed
del Self. stories[0]
# Output The page's satin
Self. Getonestory (Pagestories, Nowpage)
Spider = QSBK ()
Spider.start ()

The results of the operation are as follows:




Designing object-oriented patterns

Python crawler One of the crawl to get embarrassing encyclopedia jokes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.