Resolve the pure IP address library

Last Update:2016-04-13 Source: Internet

Author: User

Tags geoip ip number

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

For a week, has been doing IP address database parsing. From research to coding to optimization, it took about seven or eight days. It feels fun. Summarize the whole process of doing it.

1. How to parse IP address library

There are two main ways of parsing today: through the API, or through an IP database.

API Way is very simple, at present, many domestic manufacturers to provide API interface, as long as the request to send the IP, you can obtain the corresponding geographical location. Companies like BAT and others provide IP query interfaces. The advantage of this parsing method is that the coding is simple, a request to get the data, and then parse it (usually just a JSON data), and do not maintain the database, there is no burden on the local. But the shortcomings are quite obvious, the first is slow, send network requests a second, there are restrictions, such as Baidu limit of 250 requests per second, to prevent the concurrency is too big cause network congestion, again to cowgirl, what to listen to others, in case today address change, tomorrow interface data format changed, the day after the charge ...... Oh, they sell cakes.

IP database approach is relatively complex, need to have a perfect database, but also to establish the corresponding query services. The pros and cons are exactly the opposite of the API: The advantage is that the query is fast, not restricted by the network and the website, the disadvantage is that the coding is relatively complex, and maintains the database. The most famous database in the country is the pure network, IPIP, foreign more famous GeoIP and so on.

After weighing the pros and cons, we decided to take a database approach. I heard that geoip to foreign IP data is perfect, but for the domestic IP is not too full. Therefore, we initially choose the pure IP database to parse.

2. Storage
Download down the process of pure database is not introduced, I also did not bambo to parse dat, directly decompression into TXT to do. There are less than 450,000 data.

The first common sense, that is, the IP address is actually a unsigned int value. When asked about the practice in the group, I found that many people didn't even know it. The IP address that we see is the number between 4 0~255, and actually the representation of the IP address in the computer is 32-bit binary. 01010101.10101010.00110011.11001100 sauce, auntie. The 32-bit binary, of course, is a range of values for a unsigned int. IP resolution is the same, the conversion of IP into int for storage and query, is the most space-saving, most efficient method.

The book to the body, the extracted IP address library is the sauce aunt:

(Pure IP database, you can go to http://www.cz88.net to download.) This is public. I'm sure some people don't know what I'm going to say. ）

Three fields, start, end, address, IP from start to end are address this location. Careful observation, however, reveals that end does not actually have any eggs to use. Because end is connected to the next start, there is no broken IP in the middle. So I just need to record start:address, so--we got a key-value pair. Aha, we can use more weapons on key-value pairs, most typically using a database such as Redis or using a dictionary directly.

So, how do you find the query? The simplest way to extract keys, order, and find two points. 450,000 data can be compared up to 19 times. Note Here we are looking for "the maximum value of a given IP".

What the? Do you say that each IP corresponds to an address and writes the address of all IPs into a list? Well...... Not really, but first your server has to have 200G of memory. Yes, 200G. Memory.

3. Algorithm Evolution

3.1. First consider Redis

I need to keep the program running, that is, I need a server, which is a well-preserved address structure, when I need to query an IP, only need to send a request. So, if you keep the dictionary in memory, you have to run the program all the time, and I need to write a server. TCP or UDP or HTTP does not matter. However, I am lazy. So first consider Redis. After all, people have written a storage structure, I do not have to use their brains, to save the good.

But it turns out that I was wrong.

3.1.1, common key-value pairs

First use the simplest method, set IP addr, all stored in, and then query the time to read the keys, type conversion, sorting, two points, to find the maximum value of less than equal to the given IP, a set down--3.2 seconds. Peat this speed is not as good as direct API request! Think about it, you really committed two, more than 400,000 number type conversion reordering, can be fast to hell.

3.1.2, ordered set

Consider the next scheme, one to deposit integers, two to order. It is impossible to deposit integers, and the data types found in Redis online are not related to numbers at all, only strings and various sequence types. The selected ordered set is introduced by people. Zadd ip2addr IP Addr added well. However, there are always errors when querying. Inexplicably for a long time, finally found out the reason: Suppose a ip1 corresponding address is addr, too soon a ip2 corresponding address is addr, then IP2 will cover the ip1. It's not science! Alas, we can only abandon the orderly collection.

(In fact, there are big gods can still be used, if the same addr will be covered, it is artificially different, such as can store [email protected] such a form. I was worried and didn't think much about it. ）

3.1.3, List

Let the IP order, the most appropriate or a list. So in Redis I built two lists, one is IP, the other is the addr of the corresponding location. Query time to obtain the IP list, the given IP, the index of the IP to find the corresponding location of the addr. The wish is good, the reality is cruel. Since the list in Redis uses doubly linked lists, it is slow enough to get all more than 400,000 data, which results in a data query of 360ms. And there is a seemingly strange feature: the IP value is small, such as 1.2.3.4, the query result is 4ms, and the IP value of the 222.222.222.222 so close to 400ms.

This is still a slow-to-bear result.

3.1.4, List + tiles

The result of the list is probably more than 300ms, or too slow, I probably swept in Redis, there is no more suitable data structure. Then it is only possible to optimize the algorithm level. Observe the structure of the IP address, the first 220,000 data should contain the front half of the IP, the rest of the IP in the second half of the data, try to extract only half of the data from the IP list to query, sure enough time also shrunk to half, about 170ms. So, can you pinpoint where the IP is located?

Imagine 4.2 billion IPs scattered across 440,000 pieces of data, how many IPs are there in each block? Certainly not evenly distributed, but the number can be counted. I divide the int range, each 10^7 as a block, then more than 4.2 billion int number can be cut out to 430 blocks (for example, the IP value is less than 10^7 placed in the No. 0 area, less than 2*10^7 is greater than 10^7 in the 1th zone, etc.), so that the number of IP in each block is counted. The next step is to calculate the total number of IP in the first 0 blocks, how many IPs the first 1 blocks, and how many IPs the first 2 blocks ... For a chestnut, the list of statistics IP numbers is [a1,a2,a3,a4 ...], then the cumulative list is [A1,a1+a2, A1+A2+A3, a1+a2+a3+a4 ...]. This list is the IP index. This allows for precise positioning of the IP. When querying, first calculate which block the IP belongs to, then find the corresponding index, and finally find the corresponding IP range by index. Although it was queried once more, it greatly reduced the number of numbers taken from Redis. After testing, the speed has reached about 65ms.

However, there are two problems with this algorithm: the first is the block size of the setting, the need for human intervention, the size of the block involved in each block of IP number, but also related to the number of blocks, that is, the size of the index list. It's all about experience, there's no theory. Another problem is the case where the number of IPs in the block is 0. Also with the chestnut just now, there is an IP list [a1,0,0,0,a3,a4 ...], the index list is [A1,a1,a1,a1,a1+a3, a1+a3+a4 ...], that is, an IP is within the A1 range, and the next IP is already within the A3 range. Now i query an IP, this should be the scope of the query is [A1, A1+a3], and now the scope of the query into [a1, A1], which inevitably result in error. I also do not have a very good solution, now think of can only be recorded again the number of IP table, now query the IP block is not 0, if it is, go to find in this before the first not 0 block. The performance is definitely going down.

3.2. Memory

3.2.1, ordered dictionaries

After encountering the problem, asked the great gods in the Q Group. A few people who have done it are writing their own services. Alas, Ben wanted to be lazy, tossing a lap instead of the pit. Then write your own socket to do it. Storage structure in order to maintain integers and order, use Ordereddict to save. Get the keys, two points, query, look at the eye time, cry, how is it 50ms, as before?

3.2.2, dictionaries + lists

Continue to ask the great gods, how to do, get the answer is to use a list. Dawned. Use Dict.iteritems () This form of the list, not only can keep the dictionary key value of the shape, but also orderly. Ordereddict internal use of doubly linked lists, of course, how to calculate the list is faster. On the original basis of a simple change, re-test, 1ms. 1ms?! Compared to the IP library, it seems that the result is not wrong.

Well, that's it, and that's a question that will be over for the time being. Personally I think the most interesting is the intermediate Redis list + block that algorithm, can not be applied is a pity, because in the later algorithm, the main bottleneck is the speed of the socket, rather than the list of data, the speed of the simple query process has reached the level of 10^-5s. Left a few small problems, know the idea is good, anyway, can not use the best solution.

Code and so I learned to upload a bit of GitHub.

Resolve the pure IP address library

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More