Reference: http://cuiqingcai.com/990.html
1. Non-object-oriented mode
Full Code 1:
#-*-Coding:utf-8-*-
Import re
Import Urllib2
Import Urllib
Import Thread
Import time
page = 1
url = ' http://www.qiushibaike.com/hot/page/' + str (page)
User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
headers = {' User-agent ': user_agent}
Try
Request = Urllib2. Request (url,headers = headers)
Response = Urllib2.urlopen (Request)
Content = Response.read (). Decode (' Utf-8 ')
Pattern = re.compile (' <div.*?class= ' author.*?+ ' <span.*?class= ' dash ' >.*?<i.*?class= ' number ' > (. *?) </i> ', Re. S
Items = Re.findall (pattern, content)
For item in items:
Print Item[0],item[1],item[2],item[3]
Except Urllib2. Urlerror, E:
If Hasattr (E, "code"):
Print E.code
If Hasattr (E, "Reason"):
Print E.reason
The results of the operation are as follows:
Note 1: There is no need to log in for embarrassing Wikipedia, so there is no need to use cookies.
2. Object-oriented mode
The above code is the most central part, the following is what we want to achieve:
Press ENTER to read a joke that shows the publisher of the satin, the content posted, the number of likes and the number of comments.
In addition, we need to design object-oriented patterns, introduce classes and methods, and optimize and encapsulate the code .
Full Code 2 :
#-*-Coding:utf-8-*-
ImportRe
ImportUrllib2
ImportUrllib
ImportThread
ImportTime
# embarrassing Encyclopedia crawler
classQSBK:
# Initialize method, define some variables
def__init__( Self):
Self. PageIndex =1
Self. user_agent =' Mozilla/4.0 (compatible; MSIE 5.5; Windows NT) '
# Initialize Headers
Self. headers = {' User-agent ': Self. user_agent}
# The variables that store the jokes, every element is a piece of the page
Self. stories = []
# variables to keep the program running
Self. Enable =False
# Gets the page code for the index passed to a page
defGetPage ( Self, PageIndex):
Try:
URL =' http://www.qiushibaike.com/hot/page/'+Str(PageIndex)
# Request Requests for Build
Request = Urllib2. Request (URL,Headers= Self. Headers)
# get page code with Urlopen
Response = Urllib2.urlopen (Request)
# Convert the page to UTF-8 encoding
Pagecode = Response.read (). Decode (' Utf-8 ')
returnPagecode
exceptUrllib2. Urlerror, E:
ifhasattrE"Reason"):
PrintU "Connection embarrassing encyclopedia failure, cause of error", E.reason
returnNone
# Pass in a page of code and return to the list of jokes on this page
defGetpageitems ( Self, PageIndex):
Pagecode = Self. GetPage (PageIndex)
if notPagecode:
Print"Page load failed ...."
returnNone
Printitem[0], item[1], item[2], item[3]
Pattern = Re.compile (' <div.*?class= ' author.*?' <span.*?class= ' stats-vote ">.*?<i.*?class=" Number "> (. *?) </i>.*? '
+' <span.*?class= ' dash ' >.*?<i.*?class= ' number ' > (. *?) </i> ', Re. S
Items = Re.findall (Pattern,pagecode)
# to store every page of the jokes
Pagestories = []
# Traversal of regular expression matching information
forIteminchItems
Replacebr = Re.compile (' <br/> ')
Text = Re.sub (REPLACEBR,"\ n", item[1])
# Item[0] is a joke publisher, Item[1] is content, item[2] likes number, item[3] comments
Pagestories.append ([item[0].strip (), Text.strip (), item[2].strip (), item[3].strip ()])
returnPagestories
# load and extract the contents of the page and add it to the list
defLoadPage ( Self):
# Load a new page if the number of pages not currently viewed is less than 2 pages
if Self. Enable = =True:
ifLen( Self. stories) <2:
# Get a new page
Pagestories = Self. Getpageitems ( Self. PageIndex)
# Store The page's jokes in the global list
ifPagestories:
Self. Stories.append (Pagestories)
# After getting the page number index plus one, indicating next read Next page
Self. PageIndex + =1
# Call this method every time you hit Enter to print out a satin
defGetonestory ( Self, Pagestories, page):
# Traversing a page of jokes
forStoryinchPagestories:
# Wait for user input
input =Raw_input()
# Whenever you enter a carriage return, determine if you want to load a new page
Self. LoadPage ()
# If you enter q then the program ends
offinput = ="Q":
Self. Enable =False
return
Printu "page%d\ tPublisher:%s\ tComment:%s\ tLikes:%s\ n%s "% (page, story[0], story[3], story[2], story[1])
# Start method
defStart Self):
PrintU "is reading embarrassing encyclopedia, press ENTER to view new jokes, Q quit"
# Make the variable true and the program will run correctly
Self. Enable =True
# Load one page of content first
Self. LoadPage ()
# local variable, control is currently read to the first page
Nowpage =0
while Self. Enable:
ifLen( Self. stories) >0:
# Get a page of jokes from the global list
Pagestories = Self. stories[0]
# Number of pages currently read plus one
Nowpage + =1
# Remove the first element in the global list because it has been removed
del Self. stories[0]
# Output The page's satin
Self. Getonestory (Pagestories, Nowpage)
Spider = QSBK ()
Spider.start ()
The results of the operation are as follows:
Designing object-oriented patterns
Python crawler One of the crawl to get embarrassing encyclopedia jokes