Data analysis, especially in the website analysis needs to the visitor's IP analysis, analysis of the IP is mainly to distinguish between the visitor's Province + City + District data, considering the current pure IP database does not make a good distinction between these data, and then looked for another feasible solution (of course, not money to buy Kazakhstan). The solution is to crawl Sina's IP data.
The IP data interface of Sina is:
http://int.dpool.sina.com.cn/iplookup/iplookup.php?format=json&ip=123.124.2.85
The returned data is:
The code is as follows:
{"ret": 1, "Start": "123.123.221.0", "End": "123.124.158.29", "Country": "\u4e2d\u56fd", "Province": "\u5317\u4eac", " City ":" \u5317\u4eac "," District ":" "," ISP ":" \u8054\u901a "," type ":" "," desc ":"}
The returned content already contains the province + City + administrative information, which is what we really want.
The following is to say how to crawl this part of the IP data, to crawl this part of the data is the main work is enumerated, will be the interface of the IP constantly replaced, to replace all the IP address is certainly not possible, so we narrowed down the scope, only poor lifting all China's IP segment. Considering that the IP interface of Sina is returning the IP segment, the part that is to be exhausted is less. Think again. The last and 256 IPs of the IP segment are basically in one area, so we're going to have a lot less data. The most important thing for a poor lift is to change the IP address to the int type.
Specific domestic number of IP address segments, can go to apnic official website to find or the following documents
Http://ftp.apnic.net/apnic/dbase/data/country-ipv4.lst
Let's take a look at how the exhaustive program writes:
Import re def ipv3_to_int (s): l = [Int (i) for I in S.split ('. ')] Return (L[0] << 16) | (L[1] << 8) | L[2] def Int_to_ipv3 (s): ip1 = s >> & 0xFF ip2 = s >> 8 & 0xFF IP3 = S & 0xFF R Eturn "%d.%d.%d"% (ip1, IP2, IP3) i = open (' Chinaipaddress.csv ', ' r ') List = I.readlines () for iplist in list: pattern = Re.compile (' (\d{1,3}\.\d{1,3}\.\d{1,3}) \.\d{1,3} ') ips = Pattern.findall (iplist) x = ips[0] y = ips[1]< C10/>for IP in range (Ipv3_to_int (x), Ipv3_to_int (y)): ipadress=str (IP) #ip_address = Int_to_ipv3 (IP) o = open (' ChinaIPAddress.txt ', ' a ') o.writelines (ipadress) o.writelines (' \ n ') o.close () i.close ()
When the above does not go through the completion of the Sina IP interface can be crawled, the crawl code is as follows:
#!/usr/bin/python#-*-coding:utf-8-*-import urllib,urllib2, Simplejson, Sqlite3, Time def ipv3_to_int (s): l = [Int (i) For I in S.split ('. ')] Return (L[0] << 16) | (L[1] << 8) | L[2] def Int_to_ipv4 (s): Ip1 = s >> & 0xFF IP2 = s >> 8 & 0xFF IP3 = S & 0xFF return "%d.%d .%d.0 "% (ip1, IP2, IP3) def fetch (IPv4, **kwargs): kwargs.update ({' IP ': IPv4, ' format ': ' JSON ',}) Data_base = "http://int.dpool.sina.com.cn/iplookup/iplookup.php" url = data_base + '? ' + urllib.urlencode (kwargs) Print URL fails = 0 Try:result = simplejson.load (Urllib2.urlopen (url,timeout=20)) except (Urllib2. Urlerror,ioerror): Fails + = 1 if fails < 10:result = Fetch (IPv4) Else:sleep_download_time = 60*10 Time.sleep (sleep_download_time) result = Fetch (IPv4) return result def dbcreate (): c = conn.cursor () c.execut E (' CREATE table ipdata (IP integer primary key, ret integer, start text, end text, Country TExt, province text, city text, district text, ISP text, type text, desc text) conn.commit () c.cl OSE () def dbinsert (ip,address): c = conn.cursor () c.execute (' INSERT into ipdata values (?,?,?,?,?,?,?,?,?,?,?) ', (ip,addr ess[' ret '],address[' start '],address[' End '],address[' country '],address[' province '],address[' City '],address[' District '],address[' ISP '],address[' type '],address[' desc '])) Conn.commit () c.close () conn = Sqlite3.connect (' Ipaddress.sqlite3.db ') dbcreate () i = open (' ChinaIPAddress.txt ', ' r ') list = [S.strip () for S in I.readlines ()]end = 0for IP in list:ip = Int (IP) If IP > end:ipaddress = Int_to_ipv4 (IP) info = Fetch (ipaddress) if info[' ret '] = = 1 : Pass Else:dbinsert (ip,info) end = Ipv3_to_int (info[' end ') print Ip,end else:passi.close ()
This will be able to capture all of Sina's domestic IP data, and then in the data analysis of the project in great useful. ~