Reptile Foundation and regular expression: http://blog.csdn.net/gzh0222/article/details/12647723
Reptile Combat and Advanced: http://www.cnblogs.com/xin-xin/p/4297852.html
Other Network information: http://www.crifan.com/files/doc/docbook/python_topic_web_scrape/release/html/python_topic_web_scrape.html
Http://www.crifan.com/files/doc/docbook/web_scrape_emulate_login/release/html/web_scrape_emulate_login.html
Python and database: http://www.cnblogs.com/fnng/p/3565912.html
Here are the Python source code for the crawl of embarrassing encyclopedia jokes.
Software: Python2.5
System: Win7
1 #-*-coding:utf-8-*-2 3 ImportUrllib24 ImportUrllib5 ImportRe6 ImportThread7 Import Time8 9 Ten #-----------loading deal with embarrassing things encyclopedia----------- One classSpider_model: A - def __init__(self): -Self.page = 1 theSelf.count = 1 -Self.pages = [] -Self.enable =False - + #Buckle All the jokes, add them to the list and return to the list - defGetPage (self,page): +Myurl ="http://m.qiushibaike.com/hot/page/"+page AUser_agent ='mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' atheaders = {'user-agent': User_agent} -req = Urllib2. Request (myurl, headers =headers) -Myresponse =Urllib2.urlopen (req) -MyPage =Myresponse.read () - #The role of encode is to convert Unicode encoding into other encoded strings - #The role of Decode is to convert other encoded strings to Unicode encoding inUnicodepage = Mypage.decode ("Utf-8") - to #Find all class= "content" div tags + #Re. S is any matching pattern, that is. Can match line break -MyItems = Re.findall ('<div class= "content" >.*?</div>', Unicodepage,re. S) theItems = [] * forIteminchmyitems: $ #Get rid of the content of the pages in the jokesPanax NotoginsengStrinfo = Re.compile (u'<.*?>') -tt = strinfo.sub (u"', item) the + #strinfo1 = re.compile (U ' ^\n* ') A #tt = strinfo1.sub (u ", TT) the + #Strinfo2 = re.compile (U ' \n*$ ') - #tt = strinfo2.sub (u ", TT) $tt = tt.replace (u'\ n', u"') $ - - items.append (TT) the returnItems - Wuyi #used to load a new satin the defLoadPage (self): - #Keep running if the user does not enter quit Wu whileself.enable: - #if the contents of the pages array are less than 2 About ifLen (self.pages) < 2: $ Try: - #get new pages in the jokes -MyPage =Self . GetPage (str (self.page)) -Self.page + = 1 A self.pages.append (mypage) + except: the Print 'can't link to the Encyclopedia of embarrassing things! ' - Else: $Time.sleep (1) the the defshowpage (self,nowpage,page): the forItemsinchNowpage: the PrintU'Article %d \ n'%Self.count, Items -Self.count + = 1 inMyinput =raw_input () the ifMyinput = ="Q": theSelf.enable =False About Break the the defStart (self): theSelf.enable =True +page =Self.page - the PrintU'...... searching in ... \ n' Bayi the #Create a new thread to load the satin in the background and store the Thread.start_new_thread (self. LoadPage, ()) - - #-----------loading deal with embarrassing things encyclopedia----------- the whileself.enable: the #if the Self's page array contains elements the ifself.pages: theNowpage =Self.pages[0] - delSelf.pages[0] the Self . ShowPage (nowpage,page) thePage + = 1 the 94 the #entrance to the-----------program----------- the PrintU""" the --------------------------------------- 98 Program: Embarrassing reptile About Version: 1.0 - ZZ101 Date: 2016-02-16102 language: Python 2.5103 action: Enter ' Q ' to quit reading embarrassing encyclopedia104 function: Press ENTER to browse today's embarrassing hot spot the --------------------------------------- 106 """ 107 108 109 PrintU'Please press ENTER to view today's embarrassing content:' theRaw_input (' ') 111MyModel =Spider_model () theMymodel.start ()
View Code
The results are as follows:
Python crawler with MySQL