Import logs and analysis such as apachenglogs into Hive

Source: Internet
Author: User
Two methods to import nginx logs to hive: 1. Create the CREATETABLEapachelog table in hive (ipaddressSTRING, identdSTRING, userSTRING

Two methods to import nginx logs to hive: 1. create table apachelog (ipaddress STRING, identd STRING, user STRING) in hive

Two methods to import nginx logs to hive

1. Create a table in hive

  • The log format after import is

    203.208.60.91--05/May/2011: 01: 18: 47 + 0800 GET/robots.txt HTTP/1.1 404 1238 Mozilla/5.0

    This method supports the parse_url (referer, "HOST") function in hive ")

    Method 2 Import

    Note: This method must be executed before using the query statement after creating a table.

    Hive> add jar/home/hjl/hive/lib/hive_contrib.jar;

    Or set hive/conf/hive-default.conf to add


    Hive. aux. jars. path
    File: // usr/local/hadoop/hive/lib/hive-contrib-0.7.0-cdh3u0.jar.

    Save Configuration

  • 203.208.60.91--[05/May/2011: 01: 18: 47 + 0800] "GET/robots.txt HTTP/1.1" 404 1238 "-" "Mozilla/5.0 (compatible; googlebot/2.1; + )"

    The field type stringfrom deserializer in this method is tested and does not support obtaining domain names through parse_url (referer, "HOST ").

    You can use select split (referer, "/") [2] from apilog to get the Domain Name

    If the file data is plain text, you can use stored as textfile. If data needs to be compressed, use stored as sequence.

    Import log command

    Hive> load data local inpath '/home/log/map.gz' overwrite into table log;

    Imported logs support formats such as .gz

    Example

    Number of Statistics rows
    Select count (*) from nginxlog;

    IP count statistics
    Select count (DISTINCT ip) from nginxlog;

    Ranking
    Select t2.ip, t2.xx from (SELECT ip, COUNT (*) AS xx FROM nginxlog GROUP by ip) t2 sort by t2.xx desc

    Hive> SELECT * from apachelog WHERE ipaddress = '192. 211.123.184 ';

    Hive> SELECT ipaddress, COUNT (1) AS numrequest FROM apachelog group by ipaddress sort by numrequest desc limit 1;

    Hive> set mapred. reduce. tasks = 2;
    Hive> SELECT ipaddress, COUNT (1) AS numrequest FROM apachelog group by ipaddress sort by numrequest desc limit 1;

    Hive> create table ipsummary (ipaddress STRING, numrequest INT );
    Hive> insert overwrite table ipsummary SELECT ipaddress, COUNT (1) FROM apachelog group by ipaddress;

    Hive> SELECT ipsummary. ipaddress, ipsummary. numrequest FROM (select max (numrequest) AS themax FROM ipsummary) ipsummarymax JOIN ipsummary ON ipsummarymax. themax = ipsummary. numrequest;

    Export hive query results to csv (not tested)

    Hive> set hive. io. output. fileformat = CSVTextFile;
    Hive> insert overwrite local directory '/tmp/CSVrepos/'select * from S where ...;

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.