Python captures the program code of the residential community data

Source: Internet
Author: User
Tags touch xpath

A function needs a set of urban location information data, the first is the use of Baidu Map API to carry out keyword search, barely able to use, but the amount of data is very small, or a large number of community/community search.
The weekend home on the internet found that there are directly in every city of the community Daquan, ecstatic, so immediately wrote a reptile to try.
Paste the following code, Python2.7,lxml+request library.

#coding =utf-8 #author  : zx #date    : 2015/07/27 import requests import  MySQLdb import time import string import random from lxml import etree #ua头信息   Get can be randomly used headers = [    {  "user-agent": "mozilla/5.0 " (linux; u;  android 4.1; en-us; gt-n7100 build/jro03c)  AppleWebKit/534.30  (khtml,  like gecko)  version/4.0 mobile safari/534.30 "},     { " User-agent ":" mozilla/5.0  (compatible; msie 10.0; windows phone 8.0; trident /6.0; iemobile/10.0; arm; touch; nokia; lumia 520) "},     {   "user-agent": "mozilla/5.0  (Bb10; touch)  AppleWebKit/537.10+  (khtml, like  Gecko)  version/10.0.9.2372 mobile safari/537.10+ "},     { " User-agent ": "Mozilla/5.0&nbsP. (linux; android 4.4.2; gt-i9505 build/jdq39)  AppleWebKit/537.36  (khtml,  Like gecko)  version/1.5 chrome/28.0.1500.94 mobile safari/537.36 "}] #城市入口页面 #我只抓的青岛本地 # Other cities or national cities can crawl the city list through this page http://m.anjuke.com/cityList url =  ' http://m.anjuke.com/qd/xiaoqu/' req =  requests.get (URL) cookie = req.cookies.get_dict () #链接数据库 conn = mysqldb.connect (' localhost ',  ' * * * * * *,  ',  ' * * * *,  charset= ' UTF8 ') cursor = conn.cursor () SQL  =  "insert into xiaoqu  (name, lat, lng, address, district)   values  (%s, %s, %s, %s, %s) "sql_v = []  Page = etree. HTML (Req.text) Districthtml = page.xpath (u "//div[@class = ' Listcont cont_hei ']") [0] # Acquisition of the target city of the various administrative regions URL #当然如果不想区分行政区可以直接抓 "All"   that is the above URL of all the community and districturl = {} i = 0 for a
 in districthtml:    if i==0:         i = 1          continue     districtUrl[a.text] =  A.get (' href ') #开始采集 total_all = 0 for k,u in districturl.items ():      p = 1  #分页     while true:          header_i = random.randint (0, len (headers)-1)          url_p = u.rstrip ('/')  +  '-P '  + str (p)          r = requests.get (Url_p, cookies=cookie, headers=headers[header_i])          page = etree. HTML (r.text)   #这里转换大小写要按情况 ...         communitysUrlDiv =  Page.xpath (u "//div[@class = ' items ']") [0]          total = len (Communitysurldiv)        
 i = 0         for a in communitysurldiv:             i+=1              r = requests.get (A.get (' href '), cookies=cookie,  Headers=headers[header_i])             #
Crawl when found that a small number of 404 pages will directly cause the program error exit- -!              #唉   Description code is not strong enough to write.               #加了if判断和try,  errors can be skipped or do some simple processing and debugging ...              if r.status_code == 404:                  continue  &NBSp;          page = etree. HTML (R.text)             try:                  name = page.xpath (U)// h1[@class = ' F1 '] ") [0].text             except:                  print a.get (' href ')                 print  r.text                  Raw_input ()              #有少量小区未设置经纬度信息               #只能得到它的地址了       
      try:                latlng =  page.xpath (u "//a[@class = ' comm_map ']") [0]                  lat = latlng.get (' lat ')                  lng = latlng.get (' LNG ')                  address =  Latlng.get (' address ')             except:  
               lat =  '                 lng =   '                  Address = page.xpath (U "//span[@class = ' Rightarea ']/em ') [0].text             sql_ V.append ((name, lat, lng, address, k))              print  "\r\r\r",              print u "Downloading  %s  data,  %d  page, total  %d  bar, current:". Encode (' GBK ')  % (K.encode ( ' GBK '), p, total)  + string.rjust (str (i), 3). Encode (' GBK '),              time.sleep (0.5)   #每次抓取停顿           #执行插入数据库         cursor.executemany (sql, sql_v)          sql_v = []         
Time.sleep (5)    #每页完成后停顿         total_all += total         print  '         print u ' successfully warehousing  %d  data, Total  %d ". Encode (' GBK ')  %  (Total, total_all)          IF TOTAL < 500:              break         else:      
       p += 1 #及时关闭数据库   Be a good boy   task completed ~ cursor.close () Conn.close () Print u ' All data acquisition completion!  total  %d  strip '. Encode (' GBK ')  %  (Total_all) raw_input ()


Note I think that has been written in detail, in the cmd display, the string of course to turn the code.
The following is a screenshot of the running state and the resulting data.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.