python爬蟲Pragmatic系列III

最後更新：2015-03-28 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

標籤：python爬蟲爬蟲 urllib regex excel讀寫

python爬蟲Pragmatic系列III
說明：

在上一篇部落格中，我們已經學會了從趕集網上的一家公司中提取出有關的資訊，並儲存到Excel中。

本次目標：

在本節中，我們將批量下載趕集首頁上所有的公司介面（注意不是趕集網上所有的公司頁面，我們可以把這個留給之後的任務），並批量的處理所有公司的有關資訊，並儲存到Excel中。

注意：

在上一篇部落格中，我們使用的只是匹配趕集網上其中一家公司介面的中資訊，而且不幸的是，很多的其他的公司的聯絡店主模組中的資訊數量並不是固定的，即有的是10個li，而有的是9個li或者12個li，而且並不是所有公司都提供了QQ連絡方式或者公司名稱。所以，我對代碼稍微做了處理，使其能夠適應大部分的網頁。

批量下載網頁：

如：

我們將提取出該網頁所包含的公司連結，並批量下載下來。這次，我們使用了一個下載類，利用下載類來封裝我們所要的方法。我們先給定趕集網的首頁連結，下載首頁，接著我們分析出首頁包含的公司連結並儲存起來，最後我們將這些連結都下載到pagestroage檔案夾中。

代碼：

#-*-coding:utf-8-*-import reimport osfrom urllib import urlretrievefrom bs4 import BeautifulSoupclass Download(object):    '該類將包含下載給定的url和將其儲存    為相應的檔案的方法'    def __init__(self):        self.index = 0        #初始化    def getPages(self,url,isMain=False):        '根據給定的url進行下載，如果是趕集網主介面，則另行處理'        for u in url:            try:                revtal = urlretrieve(u)[0]            except IOError:                revtal = None            if revtal <> None and isMain == True:                #是趕集網主介面                self.savePages(revtal, isMain=True)            elif revtal <> None and isMain <> True:                self.savePages(revtal)            else:                print('Open url error')    def savePages(self,webpage,isMain=False):        '將給定的網頁儲存起來'        f = open(webpage)        lines = f.readlines()        f.close()        if isMain == False:            #不是主介面則按序儲存為filei.txt,i為序號            fobj = open("pagestroage\\file%s.txt"%str(self.index), 'w')            self.index += 1            fobj.writelines(lines)        else:            #是趕集網主介面，則儲存為mian.txt            fobj = open("pagestroage\main.txt",'w')            fobj.writelines(lines)                fobj.close()                def getAllComUrls(self):        '我們對趕集網的主介面進行分析，提取出所有公司的連結，儲存起來'        if os.path.exists('pagestroage\main.txt'): #判斷檔案是否存在            fobj = open('pagestroage\main.txt','r')            lines = fobj.readlines()            fobj.close()            soup = BeautifulSoup(''.join(lines))            body = soup.body            #wrapper = soup.find(id="wrapper")            leftBox = soup.find(attrs={'class':'leftBox'})            list_ = leftBox.find(attrs={'class':'list'})            ul = list_.find('ul')            li = ul.find_all('li')            href_regex = r'href="(.*?)"'            urls = []            for l in li:                urls.append('http://bj.ganji.com' + re.search(href_regex,str(l)).group(1))            #print urls            return urls        else:            print('The file is missing')            return None        if __name__ == '__main__':    #初試設定url為趕集網首頁    url=['http://bj.ganji.com/danbaobaoxian/o1/']    #執行個體化下載類    download = Download()    #先下載趕集網首頁    download.getPages(url,True)    #對下載的趕集網首頁資訊進行分析，提取出所有公司的url    urls = download.getAllComUrls()    #對上面提取的url進行下載    download.getPages(urls)

運行結果：

我們得到了十幾個包含公司網頁的文字檔。如：

分析網頁：

由上面的操作，我們已經得到了趕集網上所有公司的html文本。接著我們使用Analysiser類來處理我們得到的資料。注意，Analysiser類中的方法基本上都在前面的部落格中介紹了，這裡只是用類封裝了，並使其能夠批量處理。

代碼：

#-*-coding:utf-8-*-import refrom bs4 import BeautifulSoupimport xlwtimport osimport sysreload(sys)sys.setdefaultencoding('utf-8')class Analysiser(object):    '該類將分析下載的公司資訊儲存到Excel表格中'    def __init__(self):        '初始化一個Excel'        self.wb = xlwt.Workbook()        self.ws = self.wb.add_sheet('CompanyInfoSheet')        self.initExcel()    def initExcel(self):        '我們初試化一個表格，並給表格一個頭部，所以我們給頭部不一樣的字型'        #初始化樣式        style = xlwt.XFStyle()         #為樣式建立字型        font = xlwt.Font()         font.name = 'Times New Roman'        font.bold = True        #為樣式設定字型        style.font = font         # 使用樣式        #寫入公司名稱        self.ws.write(0,0,u'公司名稱', style)        #寫入服務特色        self.ws.write(0,1,u'服務特色', style)        #寫入服務涵蓋範圍        self.ws.write(0,2,u'服務涵蓋範圍', style)        #寫入連絡人        self.ws.write(0,3,u'連絡人', style)        #寫入商家地址        self.ws.write(0,4,u'商家地址', style)        #寫入聊天QQ        self.ws.write(0,5,u'QQ', style)        #寫入聯絡電話        self.ws.write(0,6,u'聯絡電話', style)        #寫入網址        self.ws.write(0,7,u'公司網址', style)        self.wb.save('xinrui.xls')    def analysAllFiles(self):        '''        批量分析網頁源碼，並提取出公司相關資訊        '''        #得到pagestroage（我們存放下載的公司網頁的檔案夾）下所有的檔案        filenames = os.listdir('pagestroage')        #得到所有儲存的公司數目（去除一個包含趕集網首頁的main.txt）        counts = len(filenames) - 1        #迴圈處理        for i in range(counts):            #開啟檔案，讀檔案到lines中，關閉檔案對象            f = open("pagestroage\\file%s.txt"%i, 'r')            lines = f.readlines()            f.close()            #這兩個網頁的聯絡店主模組與其他的不一樣，如果也要匹配只能重新寫代碼匹配，遂放棄            if i == 12 or i == 7:                continue            #建立一個BeautifulSoup解析樹，並利用這課解析樹依次按照            #soup-->body-->(id為wrapper的div層)-->(class屬性為clearfix的div層)            #-->(id為dzcontactus的div層)-->(class屬性為con的div層)-->ul-->(ul下的每個li)            soup = BeautifulSoup(''.join(lines))            body = soup.body #body2 = soup.find('body')            wrapper = soup.find(id="wrapper")            clearfix = wrapper.find_all(attrs={'class':'d-left-box'})[0]            dzcontactus = clearfix.find(id="dzcontactus")            con = dzcontactus.find(attrs={'class':'con'})            ul = con.find('ul')            li = ul.find_all('li')            #記錄一家公司的所有資訊，用字典儲存，可以依靠索引值對存取，也可以換成列表格儲存體            record = {}             #公司名稱            companyName = li[1].find('h1').contents[0]            record['companyName'] = companyName            #服務特色            serviceFeature = li[2].find('p').contents[0]            record['serviceFeature'] = serviceFeature                        #服務提供            serviceProvider = []            serviceProviderResultSet = li[3].find_all('a')            for service in serviceProviderResultSet:                serviceProvider.append(service.contents[0])            record['serviceProvider'] = serviceProvider            #服務涵蓋範圍            serviceScope = []             serviceScopeResultSet = li[4].find_all('a')            for scope in serviceScopeResultSet:                serviceScope.append(scope.contents[0])            record['serviceScope'] = serviceScope            #連絡人            contacts = li[5].find('p').contents[0]            contacts = str(contacts).strip().encode("utf-8")            record['contacts'] = contacts            #商家地址            addressResultSet = li[6].find('p')            re_h=re.compile('</?\w+[^>]*>')#HTML標籤            address = re_h.sub('', str(addressResultSet))            record['address'] = address.encode("utf-8")            restli = ''            for l in range(8,len(li) - 1):                restli += str(li[l])            #商家QQ            qqNumResultSet = restli            qq_regex = '(\d{5,10})'            qqNum = re.search(qq_regex,qqNumResultSet).group()            qqNum = qqNum            record['qqNum'] = qqNum                        #聯絡電話            phone_regex= '1[3|5|7|8|][0-9]{9}'            phoneNum = re.search(phone_regex,restli).group()            record['phoneNum'] = phoneNum                        #公司網址            companySite = li[len(li) - 1].find('a').contents[0]            record['companySite'] = companySite                                     self.writeToExcel(record,i + 1)    def writeToExcel(self,record,index):        '該函數將給定的record字典中所有值儲存到Excel相應的index行中'        #寫入公司名稱        companyName = record['companyName']        self.ws.write(index,0,companyName)        #寫入服務特色        serviceFeature = record['serviceFeature']        self.ws.write(index,1,serviceFeature)        #寫入服務涵蓋範圍        serviceScope = ','.join(record['serviceScope'])        self.ws.write(index,2,serviceScope)        #寫入連絡人        contacts = record['contacts']        self.ws.write(index,3,contacts.decode("utf-8"))                #寫入商家地址        address = record['address']        self.ws.write(index,4,address.decode("utf-8"))                #寫入聊天QQ        qqNum = record['qqNum']        self.ws.write(index,5,qqNum)                #寫入聯絡電話        phoneNum = record['phoneNum']        phoneNum = str(phoneNum).encode("utf-8")        self.ws.write(index,6,phoneNum.decode("utf-8"))                #寫入網址        companySite = record['companySite']        self.ws.write(index,7,companySite)        self.wb.save('xinrui.xls')if __name__ == '__main__':    ana = Analysiser()    ana.analysAllFiles()

運行結果：

我們將得到包含趕集網首頁上包含的所有公司的相關資訊的Excel，如：

後感：

看到這個Excel是不覺得很cool，終於能做點實際的事情了。

不過，我們還可以做的更好，更智能。

未完待續。

python爬蟲Pragmatic系列III

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More