Hadoop (Pig) Stats IP geolocation

Source: Internet
Author: User
Tags ip location database vcard

Very common analysis needs: The log has a guest IP (domestic), now according to the IP address of the visitors to the geographical location, accurate to the city and county level, the average data volume is 15G per day, need to statistical day, week, month results.

The final implementation is to find the IP location database, including the corresponding address of each IP segment:1.1.0.0,1.1.0.255, Fujian Province and then turn it into a specific address for each IP, namely:1.1.0.1, Fujian province. This way, you can get the actual address of this IP by using join in pig.

IP Database

The IP Geographic Information Standards Committee (IPB), an interactive network branch of the China Advertising Association, established in 12, has published a library of IP geo-information standards , with members of the board and large advertising companies. You can also download the Global City IP Library and take the CN section:

File information:cn.csv Size:1.26MB

Effective time: permanent

Read this file, the IP to the number, the difference between the starting and ending IP is the number of hosts contained in this IP segment, generated with the range function, the main Python code is as follows:

<textarea class="crayon-plain print-no" style="-moz-tab-size: 4; font-size: 12px ! important; line-height: 15px ! important; opacity: 0; z-index: 0; overflow: hidden;" readonly="" data-settings="dblclick"># coding:utf-8def Ip2long (ip_string): Return Ip_string.find ('. ')! =-1 and reduce (lambda A, b:a<<8 | b, Map (int, Ip_string.split ("."))) or ip_string if __name__ = = ' __main__ ': OUTPUT = open (R ' Out.csv ', ' a ') INPUT = open (R ' Cn.csv ') for L in Input.rea Dlines (): S, E, city = map (Ip2long, L.split (', ')) if city.strip () = = "Area": Continue for I in range ( S, e+1): Output.write ('%s,%s '% (str (i), city)) Output.close () input.close ()</textarea>
1234567891011121314 # Coding:utf-8def ip2long(ip_string): Return Ip_string.Find(‘.‘) != -1 and Reduce(Lambda A, b: a<<8 | b, map(int, ip_string. Split("." )) or ip_string if __name__ = = ' __main__ ': OUTPUT = Open(r' out.csv ', ' a ') INPUT = Open(r' cn.csv ') For L in INPUT. ReadLines():         s , e, city = mapip2long l. Split) )         if city. Strip () == ' area ' : Continue         for i in range (s, e+ 1) : OUTPUT. Write('%s,%s '%(str(i), City )) OUTPUT. Close() INPUT. Close()

Processing results will be saved in the Out.csv file, China has a total of 300 million IP, the resulting file is 7.9g,gz default compression is around 900MB, according to the map needs to be segmented.

Pig Latin

Load the log file (*.log.gz) and IP database (ip*.tar.gz) stored in HDFs, convert the guest IP in the log file into numbers, combine the Join method, and then do the group and count operations to get the number of visitors in that area:

<textarea class="crayon-plain print-no" style="-moz-tab-size: 4; font-size: 12px ! important; line-height: 15px ! important; z-index: 0; opacity: 0; overflow: hidden;" readonly="" data-settings="dblclick">ip_map_data = Load '/user/ip*.tar.gz ' USING pigstorage (', ') as (Iplong:long, code:int);d ata = Load '/home/*.log.gz ' U SING pigstorage (' "') as (Id:chararray, ip:chararray); split_data = FOREACH data {IP = strsplit (IP, ' \ \. ', 4); Generate ID, (long) ip.$0 * 16777216 + (long) ip.$1 * 65536 + (LONG) ip.$2 * + (long) ip.$3; };join_data = Join Ip_map_data by Iplong, split_data by $1;group_data = Group Join_data by $ PARALLEL 6;count = FOREACH g Roup_data GENERATE Group, COUNT (join_data);D UMP count;</textarea>
12345678 Ip_map_data = load '/user/ip*.tar.gz ' using Pigstorage ( ", ') as    (Iplong:long, Code:int); Data = LOAD '/home/*.log.gz ' USING pigstorage (' "') as (ID: Chararray, ip:chararray); Split_data = Foreach data { Ip = Strsplit (Ip, 4) Generate Id, (long) Ip.$0 * 16777216 + (long) ip.$1 * 65536 + (long) ip.$2 * 256 + (long) ip.$3;< Span class= "Crayon-h", Join_data= JOIN ip_map_data by Iplong, Split_data by $ ; Group_data= GROUP join_data by $ PARALLEL 6; count = FOREACH group_data GENERATE Group, count (join_data); DUMP count;

It takes 44 minutes and 33 seconds to count the results of a week:

(Beijing, 32174022)
(Baoding, 4244694)
(Hanhan, 1062551)
Job Name:PigLatin:ip.pig
Job-acls:all users are allowed
Job setup:successful
status:succeeded
Finished in:44mins, 33sec
Job Cleanup:successfulafterword

Before trying to make a pig UDF, using the dictionary of {ip:city} as the IP-to-city match, but the Java function size is limited, and the UDF will affect performance, so it is not adopted. If you write reduce to support a method like join between in traditional SQL, you do not need to generate a file of more than 300 million records. The above is handled in a very small test Hadoop architecture, which theoretically can be processed faster in the actual production environment.

Hadoop (Pig) Stats IP geolocation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.