Python crawl A car network data parsing HTML store Excel example _python

Source: Internet
Author: User

1, a car website address

2, the use of Firefox view found that this site's information does not use JSON data, but simply that HTML page just

3, using the Pyquery in the Pyquery Library for HTML parsing

Page style:

Copy Code code as follows:

def get_dealer_info (self):
"" To obtain reseller information "" "
Css_select = ' html body div.box div.news_wrapper div.main div.news_list div.service_main div table tr '
#使用火狐浏览器中的自动复制css路径得到需要位置数据
page = Urllib2.urlopen (self.entry_url). Read ()
#读取页面
page = Page.replace (' <br/> ', ' & ')
page = Page.replace (' <br/> ', ' & ')
#由于页面中的电话信息中使用了br换行, so there's a problem when crawling.
#问题是: If you get the data in a pair of labels, it contains <br/&gt, the data that is worth before the BR, and the data will not be found, and the reason is that parsing HTML is the task/> End standard
d = PQ (page)
#使用PyQuery解析页面, here Pq=pyquery, because from pyquery import pyquery as PQ
Dealer_list = []
#创建列表用于提交到存储方法
For Dealer_div in D (css_select):
#此处定位tr, specific data within the TD tag in this tab
p = dealer_div.findall (' TD ')
#此处p就是一个tr标签内, a collection of all TD data
dealer = {}
#此处的字典用于存储一个店铺的信息用于提交到列表中
If Len (p) ==1:
#此处多哥if判断是用于对数据进行处理, because some formats do not meet the requirements of the final data, need to be eliminated, this fast code depends on the requirements
print ' @ '
Elif len (P) ==6:
STRP = P[0].text.strip ()
Dealer[constant.city] = P[1].text.strip ()
STRC = P[2].text.strip ()

Dealer[constant.province] = P[0].text.strip ()
Dealer[constant.city] = P[1].text.strip ()
Dealer[constant.name] = P[2].text.strip ()
Dealer[constant.addresstype] = P[3].text.strip ()
Dealer[constant.address] = P[4].text.strip ()
Dealer[constant.telphone] = P[5].text.strip ()
Dealer_list.append (Dealer)
Elif len (P) ==5:
If P[0].text.strip ()!= u ' Province ':
Dealer[constant.province] = STRP
Dealer[constant.city] = P[0].text.strip ()
Dealer[constant.name] = P[1].text.strip ()
Dealer[constant.addresstype] = P[2].text.strip ()
Dealer[constant.address] = P[3].text.strip ()
Dealer[constant.telphone] = P[4].text.strip ()
Dealer_list.append (Dealer)
Elif len (P) ==3:
print ' @@ '
print ' @@@ '
Self.saver.add (Dealer_list)
Self.saver.commit ()

4, the final code execution succeeded, got the corresponding data and stored in Excel

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.