Python captures the program code of the residential community data

Last Update:2017-01-13 Source: Internet

Author: User

Tags touch xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A function needs a set of urban location information data, the first is the use of Baidu Map API to carry out keyword search, barely able to use, but the amount of data is very small, or a large number of community/community search.
The weekend home on the internet found that there are directly in every city of the community Daquan, ecstatic, so immediately wrote a reptile to try.
Paste the following code, Python2.7,lxml+request library.

#coding =utf-8 #author  : zx #date    : 2015/07/27 import requests import  MySQLdb import time import string import random from lxml import etree #ua头信息   Get can be randomly used headers = [    {  "user-agent": "mozilla/5.0 " (linux; u;  android 4.1; en-us; gt-n7100 build/jro03c)  AppleWebKit/534.30  (khtml,  like gecko)  version/4.0 mobile safari/534.30 "},     { " User-agent ":" mozilla/5.0  (compatible; msie 10.0; windows phone 8.0; trident /6.0; iemobile/10.0; arm; touch; nokia; lumia 520) "},     {   "user-agent": "mozilla/5.0  (Bb10; touch)  AppleWebKit/537.10+  (khtml, like  Gecko)  version/10.0.9.2372 mobile safari/537.10+ "},     { " User-agent ": "Mozilla/5.0&nbsP. (linux; android 4.4.2; gt-i9505 build/jdq39)  AppleWebKit/537.36  (khtml,  Like gecko)  version/1.5 chrome/28.0.1500.94 mobile safari/537.36 "}] #城市入口页面 #我只抓的青岛本地 # Other cities or national cities can crawl the city list through this page http://m.anjuke.com/cityList url =  ' http://m.anjuke.com/qd/xiaoqu/' req =  requests.get (URL) cookie = req.cookies.get_dict () #链接数据库 conn = mysqldb.connect (' localhost ',  ' * * * * * *,  ',  ' * * * *,  charset= ' UTF8 ') cursor = conn.cursor () SQL  =  "insert into xiaoqu  (name, lat, lng, address, district)   values  (%s, %s, %s, %s, %s) "sql_v = []  Page = etree. HTML (Req.text) Districthtml = page.xpath (u "//div[@class = ' Listcont cont_hei ']") [0] # Acquisition of the target city of the various administrative regions URL #当然如果不想区分行政区可以直接抓 "All"   that is the above URL of all the community and districturl = {} i = 0 for a
 in districthtml:    if i==0:         i = 1          continue     districtUrl[a.text] =  A.get (' href ') #开始采集 total_all = 0 for k,u in districturl.items ():      p = 1  #分页     while true:          header_i = random.randint (0, len (headers)-1)          url_p = u.rstrip ('/')  +  '-P '  + str (p)          r = requests.get (Url_p, cookies=cookie, headers=headers[header_i])          page = etree. HTML (r.text)   #这里转换大小写要按情况 ...         communitysUrlDiv =  Page.xpath (u "//div[@class = ' items ']") [0]          total = len (Communitysurldiv)        
 i = 0         for a in communitysurldiv:             i+=1              r = requests.get (A.get (' href '), cookies=cookie,  Headers=headers[header_i])             #
Crawl when found that a small number of 404 pages will directly cause the program error exit- -!              #唉   Description code is not strong enough to write.               #加了if判断和try,  errors can be skipped or do some simple processing and debugging ...              if r.status_code == 404:                  continue &NBSP;&NBSp;          page = etree. HTML (R.text)             try:                  name = page.xpath (U)// h1[@class = ' F1 '] ") [0].text             except:                  print a.get (' href ')                 print  r.text                  Raw_input ()              #有少量小区未设置经纬度信息               #只能得到它的地址了       
      try:                latlng =  page.xpath (u "//a[@class = ' comm_map ']") [0]                  lat = latlng.get (' lat ')                  lng = latlng.get (' LNG ')                  address =  Latlng.get (' address ')             except:  
               lat =  '                 lng =   '                  Address = page.xpath (U "//span[@class = ' Rightarea ']/em ') [0].text             sql_ V.append ((name, lat, lng, address, k))              print  "\r\r\r",              print u "Downloading  %s  data,  %d  page, total  %d  bar, current:". Encode (' GBK ')  % (K.encode ( ' GBK '), p, total)  + string.rjust (str (i), 3). Encode (' GBK '),              time.sleep (0.5)   #每次抓取停顿           #执行插入数据库         cursor.executemany (sql, sql_v)          sql_v = []         
Time.sleep (5)    #每页完成后停顿         total_all += total         print  '         print u ' successfully warehousing  %d  data, Total  %d ". Encode (' GBK ')  %  (Total, total_all)       &NBSP;&NBSP;&NBSP;IF&NBSP;TOTAL&NBSP;&LT;&NBSP;500:              break         else:      
       p += 1 #及时关闭数据库   Be a good boy   task completed ~ cursor.close () Conn.close () Print u ' All data acquisition completion!  total  %d  strip '. Encode (' GBK ')  %  (Total_all) raw_input ()

Note I think that has been written in detail, in the cmd display, the string of course to turn the code.
The following is a screenshot of the running state and the resulting data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More