1, a car website address
2, the use of Firefox view found that this site's information does not use JSON data, but simply that HTML page just
3, using the Pyquery in the Pyquery Library for HTML parsing
Page style:
Copy Code code as follows:
def get_dealer_info (self):
"" To obtain reseller information "" "
Css_select = ' html body div.box div.news_wrapper div.main div.news_list div.service_main div table tr '
#使用火狐浏览器中的自动复制css路径得到需要位置数据
page = Urllib2.urlopen (self.entry_url). Read ()
#读取页面
page = Page.replace (' <br/> ', ' & ')
page = Page.replace (' <br/> ', ' & ')
#由于页面中的电话信息中使用了br换行, so there's a problem when crawling.
#问题是: If you get the data in a pair of labels, it contains <br/>, the data that is worth before the BR, and the data will not be found, and the reason is that parsing HTML is the task/> End standard
d = PQ (page)
#使用PyQuery解析页面, here Pq=pyquery, because from pyquery import pyquery as PQ
Dealer_list = []
#创建列表用于提交到存储方法
For Dealer_div in D (css_select):
#此处定位tr, specific data within the TD tag in this tab
p = dealer_div.findall (' TD ')
#此处p就是一个tr标签内, a collection of all TD data
dealer = {}
#此处的字典用于存储一个店铺的信息用于提交到列表中
If Len (p) ==1:
#此处多哥if判断是用于对数据进行处理, because some formats do not meet the requirements of the final data, need to be eliminated, this fast code depends on the requirements
print ' @ '
Elif len (P) ==6:
STRP = P[0].text.strip ()
Dealer[constant.city] = P[1].text.strip ()
STRC = P[2].text.strip ()
Dealer[constant.province] = P[0].text.strip ()
Dealer[constant.city] = P[1].text.strip ()
Dealer[constant.name] = P[2].text.strip ()
Dealer[constant.addresstype] = P[3].text.strip ()
Dealer[constant.address] = P[4].text.strip ()
Dealer[constant.telphone] = P[5].text.strip ()
Dealer_list.append (Dealer)
Elif len (P) ==5:
If P[0].text.strip ()!= u ' Province ':
Dealer[constant.province] = STRP
Dealer[constant.city] = P[0].text.strip ()
Dealer[constant.name] = P[1].text.strip ()
Dealer[constant.addresstype] = P[2].text.strip ()
Dealer[constant.address] = P[3].text.strip ()
Dealer[constant.telphone] = P[4].text.strip ()
Dealer_list.append (Dealer)
Elif len (P) ==3:
print ' @@ '
print ' @@@ '
Self.saver.add (Dealer_list)
Self.saver.commit ()
4, the final code execution succeeded, got the corresponding data and stored in Excel