python抓取某汽車網資料解析html存入excel樣本

python抓取某汽車網資料解析html存入excel樣本_python

最後更新：2017-01-18 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

1、某汽車網站地址

2、使用firefox查看後發現，此網站的資訊未使用json資料，而是簡單那的html頁面而已

3、使用pyquery庫中的PyQuery進行html的解析

頁面樣式：

複製代碼代碼如下:

def get_dealer_info(self):
        """擷取經銷商資訊"""
        css_select = 'html body div.box div.news_wrapper div.main div.news_list div.service_main div table tr '
        #使用Firefox瀏覽器中的自動複製css路徑得到需要位置資料
        page = urllib2.urlopen(self.entry_url).read()
        #讀取頁面
        page = page.replace('<br />','&')
        page = page.replace('<br/>','&')
        #由於頁面中的電話資訊中使用了br換行，所以在抓取的時候會產生問題
        #問題是：如果取得一對標籤中的資料，中包含<br/>,會出現值得到br之前的資料，而後的資料將得不到，原因個人認為是解析html是會任務/>結尾標準
        d = pq(page)
        #使用PyQuery解析頁面，此處pq=PyQuery,因為from pyquery import PyQuery as pq
        dealer_list = []
        #建立列表用於提交到儲存方法
        for dealer_div in d(css_select):
            #此處定位tr，具體資料在此標籤中的td標籤內
            p = dealer_div.findall('td')
            #此處p就是一個tr標籤內，全部td資料的集合
            dealer = {}
            #此處的字典用於儲存一個店鋪的資訊用於提交到列表中
            if len(p)==1:
                #此處多哥if判斷是用於對資料進行處理，因為一些格式不符合最終資料的要求，需要剔除，這個快的代碼按需求而定
                print '@'
            elif len(p)==6 :
                strp = p[0].text.strip()
                dealer[Constant.CITY] = p[1].text.strip()
                strc = p[2].text.strip()

                dealer[Constant.PROVINCE] = p[0].text.strip()
                dealer[Constant.CITY] = p[1].text.strip()
                dealer[Constant.NAME] = p[2].text.strip()
                dealer[Constant.ADDRESSTYPE] = p[3].text.strip()
                dealer[Constant.ADDRESS] = p[4].text.strip()
                dealer[Constant.TELPHONE] = p[5].text.strip()
                dealer_list.append(dealer)
            elif len(p)==5:
                if p[0].text.strip() != u'省份':
                    dealer[Constant.PROVINCE] = strp
                    dealer[Constant.CITY] = p[0].text.strip()
                    dealer[Constant.NAME] = p[1].text.strip()
                    dealer[Constant.ADDRESSTYPE] = p[2].text.strip()
                    dealer[Constant.ADDRESS] = p[3].text.strip()
                    dealer[Constant.TELPHONE] = p[4].text.strip()
                    dealer_list.append(dealer)
            elif len(p)==3:
                print '@@'
        print '@@@'
        self.saver.add(dealer_list)
        self.saver.commit()

4、最終代碼執行成功，得到了相應資料並存入excel中

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More