Python crawl A car network data parsing HTML deposit Excel sample

Source: Internet
Author: User
1, a car website address

2, using the Firefox view found that the information on this site is not using JSON data, but the simple HTML page

3, using the Pyquery in Pyquery Library for HTML parsing

Page style:

The code is as follows:


def get_dealer_info (self):
"" "Get Reseller Information" ""
Css_select = ' html body div.box div.news_wrapper div.main div.news_list div.service_main div table tr '
#使用火狐浏览器中的自动复制css路径得到需要位置数据
page = Urllib2.urlopen (self.entry_url). Read ()
#读取页面
page = Page.replace ('
', ' & ')
page = Page.replace ('
', ' & ')
#由于页面中的电话信息中使用了br换行, so there is a problem when crawling
#问题是: If you get the data in a pair of tags, it contains
, the data before the BR will appear, and then the data will not be found, for the reason the individual thinks parsing HTML is the task/> End standard
d = PQ (page)
#使用PyQuery解析页面, Pq=pyquery here because from pyquery import pyquery as PQ
Dealer_list = []
#创建列表用于提交到存储方法
For Dealer_div in D (css_select):
#此处定位tr, the specific data within the TD tag in this tab
p = dealer_div.findall (' TD ')
#此处p就是一个tr标签内, a collection of all TD data
dealer = {}
#此处的字典用于存储一个店铺的信息用于提交到列表中
If Len (p) ==1:
#此处多哥if判断是用于对数据进行处理, because some formats do not meet the requirements of the final data and need to be removed, this fast code depends on demand
print ' @ '
Elif len (P) ==6:
STRP = P[0].text.strip ()
Dealer[constant.city] = P[1].text.strip ()
STRC = P[2].text.strip ()

Dealer[constant.province] = P[0].text.strip ()
Dealer[constant.city] = P[1].text.strip ()
Dealer[constant.name] = P[2].text.strip ()
Dealer[constant.addresstype] = P[3].text.strip ()
Dealer[constant.address] = P[4].text.strip ()
Dealer[constant.telphone] = P[5].text.strip ()
Dealer_list.append (Dealer)
Elif len (P) ==5:
If P[0].text.strip ()! = U ' province ':
Dealer[constant.province] = STRP
Dealer[constant.city] = P[0].text.strip ()
Dealer[constant.name] = P[1].text.strip ()
Dealer[constant.addresstype] = P[2].text.strip ()
Dealer[constant.address] = P[3].text.strip ()
Dealer[constant.telphone] = P[4].text.strip ()
Dealer_list.append (Dealer)
Elif len (P) ==3:
print ' @@ '
print ' @@@ '
Self.saver.add (Dealer_list)
Self.saver.commit ()

4, the final code execution is successful, the corresponding data is obtained and stored in Excel

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.