Use the Python program to crawl all of Sina's IP tutorials in the country

Source: Internet
Author: User
Data analysis, especially in the website analysis needs to the visitor's IP analysis, analysis of the IP is mainly to distinguish between the visitor's Province + City + District data, considering the current pure IP database does not make a good distinction between these data, and then looked for another feasible solution (of course, not money to buy Kazakhstan). The solution is to crawl Sina's IP data.

The IP data interface of Sina is:

http://int.dpool.sina.com.cn/iplookup/iplookup.php?format=json&ip=123.124.2.85

The returned data is:

The code is as follows:


{"ret": 1, "Start": "123.123.221.0", "End": "123.124.158.29", "Country": "\u4e2d\u56fd", "Province": "\u5317\u4eac", " City ":" \u5317\u4eac "," District ":" "," ISP ":" \u8054\u901a "," type ":" "," desc ":"}

The returned content already contains the province + City + administrative information, which is what we really want.

The following is to say how to crawl this part of the IP data, to crawl this part of the data is the main work is enumerated, will be the interface of the IP constantly replaced, to replace all the IP address is certainly not possible, so we narrowed down the scope, only poor lifting all China's IP segment. Considering that the IP interface of Sina is returning the IP segment, the part that is to be exhausted is less. Think again. The last and 256 IPs of the IP segment are basically in one area, so we're going to have a lot less data. The most important thing for a poor lift is to change the IP address to the int type.

Specific domestic number of IP address segments, can go to apnic official website to find or the following documents

Http://ftp.apnic.net/apnic/dbase/data/country-ipv4.lst

Let's take a look at how the exhaustive program writes:


Import re def ipv3_to_int (s):  l = [Int (i) for I in S.split ('. ')]  Return (L[0] << 16) | (L[1] << 8) | L[2] def Int_to_ipv3 (s):  ip1 = s >> & 0xFF  ip2 = s >> 8 & 0xFF  IP3 = S & 0xFF  R Eturn "%d.%d.%d"% (ip1, IP2, IP3) i = open (' Chinaipaddress.csv ', ' r ') List = I.readlines () for iplist in list:  pattern = Re.compile (' (\d{1,3}\.\d{1,3}\.\d{1,3}) \.\d{1,3} ')  ips = Pattern.findall (iplist)  x = ips[0]  y = ips[1]< C10/>for IP in range (Ipv3_to_int (x), Ipv3_to_int (y)):    ipadress=str (IP)    #ip_address = Int_to_ipv3 (IP)    o = open (' ChinaIPAddress.txt ', ' a ')    o.writelines (ipadress)    o.writelines (' \ n ')  o.close () i.close ()

When the above does not go through the completion of the Sina IP interface can be crawled, the crawl code is as follows:

#!/usr/bin/python#-*-coding:utf-8-*-import urllib,urllib2, Simplejson, Sqlite3, Time def ipv3_to_int (s): l = [Int (i)  For I in S.split ('. ')] Return (L[0] << 16) | (L[1] << 8) | L[2] def Int_to_ipv4 (s): Ip1 = s >> & 0xFF IP2 = s >> 8 & 0xFF IP3 = S & 0xFF return "%d.%d .%d.0 "% (ip1, IP2, IP3) def fetch (IPv4, **kwargs): kwargs.update ({' IP ': IPv4, ' format ': ' JSON ',}) Data_base =  "http://int.dpool.sina.com.cn/iplookup/iplookup.php" url = data_base + '? ' + urllib.urlencode (kwargs) Print URL fails = 0 Try:result = simplejson.load (Urllib2.urlopen (url,timeout=20)) except (Urllib2.      Urlerror,ioerror): Fails + = 1 if fails < 10:result = Fetch (IPv4) Else:sleep_download_time = 60*10 Time.sleep (sleep_download_time) result = Fetch (IPv4) return result def dbcreate (): c = conn.cursor () c.execut E (' CREATE table ipdata (IP integer primary key, ret integer, start text, end text, Country TExt, province text, city text, district text, ISP text, type text, desc text) conn.commit () c.cl OSE () def dbinsert (ip,address): c = conn.cursor () c.execute (' INSERT into ipdata values (?,?,?,?,?,?,?,?,?,?,?) ', (ip,addr ess[' ret '],address[' start '],address[' End '],address[' country '],address[' province '],address[' City '],address[' District '],address[' ISP '],address[' type '],address[' desc '])) Conn.commit () c.close () conn = Sqlite3.connect (' Ipaddress.sqlite3.db ') dbcreate () i = open (' ChinaIPAddress.txt ', ' r ') list = [S.strip () for S in I.readlines ()]end = 0for IP in list:ip = Int (IP) If IP > end:ipaddress = Int_to_ipv4 (IP) info = Fetch (ipaddress) if info[' ret '] = = 1  : Pass Else:dbinsert (ip,info) end = Ipv3_to_int (info[' end ') print Ip,end else:passi.close ()

This will be able to capture all of Sina's domestic IP data, and then in the data analysis of the project in great useful. ~

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.