Very common analysis needs: The log has a guest IP (domestic), now according to the IP address of the visitors to the geographical location, accurate to the city and county level, the average data volume is 15G per day, need to statistical day, week, month results.
The final implementation is to find the IP location database, including the corresponding address of each IP segment:1.1.0.0,1.1.0.255, Fujian Province and then turn it into a specific address for each IP, namely:1.1.0.1, Fujian province. This way, you can get the actual address of this IP by using join in pig.
IP Database
The IP Geographic Information Standards Committee (IPB), an interactive network branch of the China Advertising Association, established in 12, has published a library of IP geo-information standards , with members of the board and large advertising companies. You can also download the Global City IP Library and take the CN section:
File information:cn.csv Size:1.26MB
Effective time: permanent
Read this file, the IP to the number, the difference between the starting and ending IP is the number of hosts contained in this IP segment, generated with the range function, the main Python code is as follows:
<textarea class="crayon-plain print-no" style="-moz-tab-size: 4; font-size: 12px ! important; line-height: 15px ! important; opacity: 0; z-index: 0; overflow: hidden;" readonly="" data-settings="dblclick"># coding:utf-8def Ip2long (ip_string): Return Ip_string.find ('. ')! =-1 and reduce (lambda A, b:a<<8 | b, Map (int, Ip_string.split ("."))) or ip_string if __name__ = = ' __main__ ': OUTPUT = open (R ' Out.csv ', ' a ') INPUT = open (R ' Cn.csv ') for L in Input.rea Dlines (): S, E, city = map (Ip2long, L.split (', ')) if city.strip () = = "Area": Continue for I in range ( S, e+1): Output.write ('%s,%s '% (str (i), city)) Output.close () input.close ()</textarea>
1234567891011121314 |
# Coding:utf-8def ip2long(ip_string): Return Ip_string.Find(‘.‘) != -1 and Reduce(Lambda A, b: a<<8 | b, map(int, ip_string. Split("." )) or ip_string if __name__ = = ' __main__ ': OUTPUT = Open(r' out.csv ', ' a ') INPUT = Open(r' cn.csv ') For L in INPUT. ReadLines(): s , e, city = mapip2long l. Split) ) if city. Strip () == ' area ' : Continue for i in range (s, e+ 1) : OUTPUT. Write('%s,%s '%(str(i), City )) OUTPUT. Close() INPUT. Close() |
Processing results will be saved in the Out.csv file, China has a total of 300 million IP, the resulting file is 7.9g,gz default compression is around 900MB, according to the map needs to be segmented.
Pig Latin
Load the log file (*.log.gz) and IP database (ip*.tar.gz) stored in HDFs, convert the guest IP in the log file into numbers, combine the Join method, and then do the group and count operations to get the number of visitors in that area:
<textarea class="crayon-plain print-no" style="-moz-tab-size: 4; font-size: 12px ! important; line-height: 15px ! important; z-index: 0; opacity: 0; overflow: hidden;" readonly="" data-settings="dblclick">ip_map_data = Load '/user/ip*.tar.gz ' USING pigstorage (', ') as (Iplong:long, code:int);d ata = Load '/home/*.log.gz ' U SING pigstorage (' "') as (Id:chararray, ip:chararray); split_data = FOREACH data {IP = strsplit (IP, ' \ \. ', 4); Generate ID, (long) ip.$0 * 16777216 + (long) ip.$1 * 65536 + (LONG) ip.$2 * + (long) ip.$3; };join_data = Join Ip_map_data by Iplong, split_data by $1;group_data = Group Join_data by $ PARALLEL 6;count = FOREACH g Roup_data GENERATE Group, COUNT (join_data);D UMP count;</textarea>
12345678 |
Ip_map_data = load '/user/ip*.tar.gz ' using Pigstorage ( ", ') as (Iplong:long, Code:int); Data = LOAD '/home/*.log.gz ' USING pigstorage (' "') as (ID: Chararray, ip:chararray); Split_data = Foreach data { Ip = Strsplit (Ip, 4) Generate Id, (long) Ip.$0 * 16777216 + (long) ip.$1 * 65536 + (long) ip.$2 * 256 + (long) ip.$3;< Span class= "Crayon-h", Join_data= JOIN ip_map_data by Iplong, Split_data by $ ; Group_data= GROUP join_data by $ PARALLEL 6; count = FOREACH group_data GENERATE Group, count (join_data); DUMP count; |
It takes 44 minutes and 33 seconds to count the results of a week:
(Beijing, 32174022)
(Baoding, 4244694)
(Hanhan, 1062551)
Job Name:PigLatin:ip.pig
Job-acls:all users are allowed
Job setup:successful
status:succeeded
Finished in:44mins, 33sec
Job Cleanup:successfulafterword
Before trying to make a pig UDF, using the dictionary of {ip:city} as the IP-to-city match, but the Java function size is limited, and the UDF will affect performance, so it is not adopted. If you write reduce to support a method like join between in traditional SQL, you do not need to generate a file of more than 300 million records. The above is handled in a very small test Hadoop architecture, which theoretically can be processed faster in the actual production environment.
Hadoop (Pig) Stats IP geolocation