This article mainly introduces how to use Python programs to capture all the IP addresses of Sina in China. as a small practice of getting IP addresses in Python network programming, if you need it, you can refer to data analysis, in particular, the visitor's IP address needs to be analyzed in website analysis. the analysis IP address mainly distinguishes the visitor's province, city, and administrative district data, considering that the pure IP database does not make a good distinction between the data, it looks for another feasible solution (of course, it is not costly to buy it ). The solution is to capture Sina's IP data.
Sina's IP data interface is:
Http://int.dpool.sina.com.cn/iplookup/iplookup.php? Format = json & ip = 123.124.2.85
The returned data is:
The code is as follows:
{"Ret": 1, "start": "123.123.221.0", "end": "123.124.158.29", "country": "\ u4e2d \ u56fd", "province ": "\ u5317 \ u4eac", "city": "\ u5317 \ u4eac", "district": "", "isp": "\ u8054 \ u901a", "type ": "", "desc ":""}
The returned content includes the province, city, and administrative region information. this is what we really want.
Next, let's talk about how to capture this part of IP data. The main task of capturing this part of data is enumeration. the IP address in the interface is constantly replaced. it is certainly impossible to replace all the IP addresses, therefore, we narrow down the scope and only list all IP segments in China. Considering that Sina's IP interface returns an IP segment, the amount of effort required is missing. The last and 256 IP addresses in the IP segment are basically in one region. Therefore, we need to drop a lot of data. The most important thing to do is to replace the IP address with the INT type.
For specific IP address segments in China, visit the official APNIC website or the following documents.
Http://ftp.apnic.net/apnic/dbase/data/country-ipv4.lst
Let's take a look at how to write the exhaustive program:
import re def ipv3_to_int(s): l = [int(i) for i in s.split('.')] return (l[0] << 16) | (l[1] << 8) | l[2] def int_to_ipv3(s): ip1 = s >> 16 & 0xFF ip2 = s >> 8 & 0xFF ip3 = s & 0xFF return "%d.%d.%d" % (ip1, ip2, ip3) i = open('ChinaIPAddress.csv', 'r')list = i.readlines()for iplist in list: pattern = re.compile('(\d{1,3}\.\d{1,3}\.\d{1,3})\.\d{1,3}') ips = pattern.findall(iplist) x = ips[0] y = ips[1] for ip in range (ipv3_to_int(x),ipv3_to_int(y)): ipadress=str(ip) #ip_address = int_to_ipv3(ip) o = open('ChinaIPAddress.txt','a') o.writelines(ipadress) o.writelines('\n') o.close()i.close()
After the preceding steps are completed, you can crawl the Sina IP interface. the capture code is as follows:
#!/usr/bin/python# -*- coding: utf-8 -*-import urllib,urllib2, simplejson, sqlite3, time def ipv3_to_int(s): l = [int(i) for i in s.split('.')] return (l[0] << 16) | (l[1] << 8) | l[2] def int_to_ipv4(s): ip1 = s >> 16 & 0xFF ip2 = s >> 8 & 0xFF ip3 = s & 0xFF return "%d.%d.%d.0" % (ip1, ip2, ip3) def fetch(ipv4, **kwargs): kwargs.update({ 'ip': ipv4, 'format': 'json', }) DATA_BASE = "http://int.dpool.sina.com.cn/iplookup/iplookup.php" url = DATA_BASE + '?' + urllib.urlencode(kwargs) print url fails = 0 try: result = simplejson.load(urllib2.urlopen(url,timeout=20)) except (urllib2.URLError,IOError): fails += 1 if fails < 10: result = fetch(ipv4) else: sleep_download_time = 60*10 time.sleep(sleep_download_time) result = fetch(ipv4) return result def dbcreate(): c = conn.cursor() c.execute('''create table ipdata( ip integer primary key, ret integer, start text, end text, country text, province text, city text, district text, isp text, type text, desc text )''') conn.commit() c.close() def dbinsert(ip,address): c = conn.cursor() c.execute('insert into ipdata values(?,?,?,?,?,?,?,?,?,?,?)',(ip,address['ret'],address['start'],address['end'],address['country'],address['province'],address['city'],address['district'],address['isp'],address['type'],address['desc'])) conn.commit() c.close() conn = sqlite3.connect('ipaddress.sqlite3.db')dbcreate() i = open('ChinaIPAddress.txt','r')list = [s.strip() for s in i.readlines()]end = 0for ip in list: ip = int(ip) if ip > end : ipaddress = int_to_ipv4(ip) info = fetch(ipaddress) if info['ret'] == -1: pass else: dbinsert(ip,info) end = ipv3_to_int(info['end']) print ip,end else : passi.close()
By now, all the domestic IP address data of Sina can be captured and used in the data analysis project .~