Python爬蟲與mysql

最後更新：2016-02-17 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：

爬蟲基礎及Regex：http://blog.csdn.net/gzh0222/article/details/12647723

爬蟲實戰及進階：http://www.cnblogs.com/xin-xin/p/4297852.html

其他網路資料：http://www.crifan.com/files/doc/docbook/python_topic_web_scrape/release/html/python_topic_web_scrape.html

　　　　　　 http://www.crifan.com/files/doc/docbook/web_scrape_emulate_login/release/html/web_scrape_emulate_login.html

Python與資料庫：http://www.cnblogs.com/fnng/p/3565912.html

以下是爬糗事百科段子的Python源碼

軟體：Python2.5

系統：win7

  1 # -*- coding: utf-8 -*-      2        3 import urllib2      4 import urllib      5 import re      6 import thread      7 import time      8     9      10 #----------- 載入處理糗事百科 -----------     11 class Spider_Model:     12          13     def __init__(self):     14         self.page = 1 15         self.count = 1 16         self.pages = []     17         self.enable = False     18      19     # 將所有的段子都扣出來，添加到列表中並且返回列表     20     def GetPage(self,page):     21         myUrl = "http://m.qiushibaike.com/hot/page/" + page     22         user_agent = ‘Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)‘    23         headers = { ‘User-Agent‘ : user_agent }    24         req = urllib2.Request(myUrl, headers = headers)    25         myResponse = urllib2.urlopen(req)   26         myPage = myResponse.read()     27         #encode的作用是將unicode編碼轉換成其他編碼的字串     28         #decode的作用是將其他編碼的字串轉換成unicode編碼     29         unicodePage = myPage.decode("utf-8")     30      31         # 找出所有class="content"的div標記     32         #re.S是任意匹配模式，也就是.可以匹配分行符號     33         myItems = re.findall(‘<div class="content">.*?</div>‘,unicodePage,re.S)     34         items = []     35         for item in myItems:     36             #去掉段子中網頁相關的內容 37             strinfo = re.compile(u‘<.*?>‘) 38             tt = strinfo.sub(u‘‘, item) 39  40             #strinfo1 = re.compile(u‘^\n*‘) 41             #tt = strinfo1.sub(u‘‘, tt) 42  43             #strinfo2 = re.compile(u‘\n*$‘) 44             #tt = strinfo2.sub(u‘‘, tt) 45             tt = tt.replace(u‘\n‘,u‘‘) 46              47              48             items.append(tt)     49         return items     50      51     # 用於載入新的段子     52     def LoadPage(self):     53         # 如果使用者未輸入quit則一直運行     54         while self.enable:     55             # 如果pages數組中的內容小於2個     56             if len(self.pages) < 2:     57                 try:     58                     # 擷取新的頁面中的段子們     59                     myPage = self.GetPage(str(self.page))     60                     self.page += 1     61                     self.pages.append(myPage)     62                 except:     63                     print ‘無法連結糗事百科！‘     64             else:     65                 time.sleep(1)     66              67     def ShowPage(self,nowPage,page):     68         for items in nowPage:     69             print u‘第%d條\n‘ % self.count , items 70             self.count += 1 71             myInput = raw_input()     72             if myInput == "q":     73                 self.enable = False     74                 break     75              76     def Start(self):     77         self.enable = True     78         page = self.page     79      80         print u‘......正在搜尋中......\n‘     81              82         # 建立一個線程在後台載入段子並儲存     83         thread.start_new_thread(self.LoadPage,())     84              85         #----------- 載入處理糗事百科 -----------     86         while self.enable:     87             # 如果self的page數組中存有元素     88             if self.pages:     89                 nowPage = self.pages[0]     90                 del self.pages[0]     91                 self.ShowPage(nowPage,page)     92                 page += 1     93      94      95 #----------- 程式的入口處 -----------     96 print u"""   97 ---------------------------------------   98    程式：糗百爬蟲   99    版本：1.0  100    zz  101    日期：2016-02-16  102    語言：Python 2.5  103    操作：輸入‘q‘退出閱讀糗事百科  104    功能：按下斷行符號依次瀏覽今日的糗百熱點  105 ---------------------------------------  106 """  107     108     109 print u‘請按下斷行符號瀏覽今日的糗百內容：‘    110 raw_input(‘ ‘)    111 myModel = Spider_Model()    112 myModel.Start()

View Code

運行效果如下：

Python爬蟲與mysql

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python爬蟲與mysql

聯繫我們

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support