Hdinsight-hadoop actual Combat (i) Website log analysis

Source: Internet
Author: User
Tags odbc connection hdinsight

Hdinsight-hadoop actual Combat (i) Website log analysis

Brief introduction

In this example, you will use the HDInsight query that parses the Web site log file to gain insight into how customers use the site. With this analysis, you can see a summary of the frequency of visits to the site within a day of the external site and the site error of the user experience.

In this tutorial, you will learn how to use HDInsight:

    • Connect to an Azure Storage Blob that contains a Web site log file
    • Create a hive table to query these logs
    • Create a hive query to analyze the data
    • Use Microsoft Excel to connect to HDInsight (using an ODBC connection) to retrieve the parsed data
Prerequisite

You have used the cluster to configure everything you need to complete your scripts and queries. To export the analyzed data to Microsoft Excel, you must meet the following requirements:

    • You must have Microsoft Excel , or Microsoft Excel, installed.
    • You must have a Microsoft hive ODBC driver to import data from the hive into Excel. Select a 32-bit or 64-bit version based on a Microsoft Excel version.
Web site log data that has been loaded into a Windows Azure storage Blob

The following is the location of the site log data used by this example. You can access this data from the file Browser tab at the top of this page. You can also do this in the [default storage Account]/[defaultcontainer]/hdisamples/websitelogsampledata/samplelog Path to access the data for this sample.

Site log Data

Wasb://[email protected]/hdisamples/websitelogsampledata/samplelog/

Creating a hive table to query Web site log data

The following hive statement creates an external table that allows the hive to query data stored in Azure Blob storage. The external table preserves the data in its original file format, while allowing the hive to execute queries against the data within the file.

A hive statement creates a new table called a blog by describing the fields within the file, the qualifiers between fields, and the location of the files in Azure Blob storage. In this tutorial, you create a hive query to analyze the data section, and you execute the query against the data that is stored in this table.

Createexternal Table weblogs

DROP TABLE ifexists weblogs;

--create tableweblogs on space-delimited website log data

CREATE externaltable IF not EXISTS weblogs (s_date date, s_time string, S_sitename string,cs_method string, Cs_uristem stri Ng

cs_uriquerystring, S_port int, cs_username string, c_ip string, cs_useragent string,

Cs_cookiestring, Cs_referer string, cs_host string, sc_status int, sc_substatus int,

Sc_win32statusint, sc_bytes int, cs_bytes int, S_timetaken int)

ROW formatdelimited fields TERMINATED by '

STORED astextfile location ' Wasb://[email protected]/hdisamples/websitelogsampledata/samplelog/'

Tblproperties (' skip.header.line.count ' = ' 2 ');

Create a hive query to analyze the data

The following hive query creates two new tables based on the queries that are running on the Web log table. The new table is named clienterrors and refersperday.

clienterrors queries from the to to sc_status column. The extracted data is then sorted by the number of occurrences of each error code and written to the clienterrors table.

Refersperday Query extracts data from the blog table of all external sites that reference this site. External site information is extracted from the cs_referer column of the blog table. To ensure that the reference link does not encounter errors, the table displays only the page data that returns the HTTP status code between. The extracted data is then written to the refersperday table.

DROP TABLE ifexists clienterrors;

--create tableclienterrors for storing errors users experienced and their frequencies

CREATE externaltable clienterrors (sc_status int, cs_referer string, cs_page string, cnt int)

ROW formatdelimited fields TERMINATED by ', ';

--populate tableclienterrors with data from table weblogs

INSERT overwritetable clienterrors

SELECT Sc_status,cs_referer,

Concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4} -[0-9A-F]{12} ', ') cs_page,

Count (DISTINCTC_IP) as CNT

From weblogs

WHERE sc_status>=400 and Sc_status < 500

GROUP Bysc_status, Cs_referer, concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f ]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12} ', ')

ORDER by CNT;

-------------------------------------------------------------------------------------------

DROP TABLE ifexists Refersperday;

--create Tablerefersperday for storing references from external websites

CREATE externaltable IF not EXISTS refersperday (year int, month int, day int, cs_refererstring, cnt int)

ROW formatdelimited fields TERMINATED by ', ';

--populate tablerefersperday with data from the Weblogs table

INSERT overwritetable Refersperday

Selectyear (S_date), month (s_date), Day (s_date), Cs_referer, COUNT (distinct c_ip) ascnt

From weblogs

WHERE sc_status>=200 and Sc_status <300

GROUP by S_date,cs_referer

ORDER by Cntdesc;

-------------------------------------------------------------------------------------------

Executing query

Click Submit to run the query shown in the previous section. The query performs the following tasks:

    1. Create a blog table from the original site log data in the Azure Blob storage associated with the HDInsight cluster.
    2. Create and populate the clienterrors and refersperday tables described in the previous section.

When you run a query, you can click View Details to get more information about running tasks in the background. After all the jobs at the bottom of the page are in the completed state, continue to load the data into Excel.

DROP TABLE ifexists weblogs;

--create tableweblogs on space-delimited website log data

CREATE externaltable IF not EXISTS weblogs (s_date date, s_time string, S_sitename string,cs_method string, Cs_uristem stri Ng

cs_uriquerystring, S_port int, cs_username string, c_ip string, cs_useragent string,

Cs_cookiestring, Cs_referer string, cs_host string, sc_status int, sc_substatus int,

Sc_win32statusint, sc_bytes int, cs_bytes int, S_timetaken int)

ROW formatdelimited fields TERMINATED by '

STORED astextfile location ' Wasb://[email protected]/hdisamples/websitelogsampledata/samplelog/'

Tblproperties (' skip.header.line.count ' = ' 2 ');

DROP TABLE ifexists clienterrors;

--create tableclienterrors for storing errors users experienced and their frequencies

CREATE externaltable clienterrors (sc_status int, cs_referer string, cs_page string, cnt int)

ROW formatdelimited fields TERMINATED by ', ';

--populatetable clienterrors with data from table weblogs

Insertoverwrite TABLE clienterrors

Selectsc_status, Cs_referer,

Concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4} -[0-9A-F]{12} ', ') cs_page,

COUNT (distinct c_ip) ascnt

From weblogs

WHERE sc_status>=400 and Sc_status < 500

GROUP Bysc_status, Cs_referer, concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f ]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12} ', ')

ORDER by CNT;

DROP TABLE ifexists Refersperday;

--create Tablerefersperday for storing references from external websites

CREATE externaltable IF not EXISTS refersperday (year int, month int, day int, cs_refererstring, cnt int)

ROW formatdelimited fields TERMINATED by ', ';

--populatetable refersperday with data from the Weblogs table

Insertoverwrite TABLE Refersperday

Selectyear (S_date), month (s_date), Day (s_date), Cs_referer, COUNT (distinct c_ip) ascnt

From weblogs

WHERE sc_status>=200 and Sc_status <300

GROUP Bys_date, Cs_referer

ORDER by Cntdesc;


Job session

Query name

Date

Id

Operation

State

No data available in table

Loading data into Excel

After you create a hive table, you can use the Microsoft hive ODBC Drive to import data from the hive to Excel. After you install the driver, use the following steps to connect to the table.

1. Open Excel and create a blank worksheet.

2. From the Data tab, select from a different source and select from Microsoft query .

3. When prompted to select a data source , select the sample Microsoft hive DSN.

4. In the Microsoft hive ODBC Drive Connection dialog box, enter the following values, and then click OK.

    • host -the host name of the HDInsight cluster. For example, mycluster.azurehdinsight.net
    • User name -administrator name for the HDInsight cluster
    • Password -Administrator password

All other fields are the default values.

5. In the Query Wizard, select the refersperday table and select the > button.

6. Click next to continue to view the wizard until you reach the dialog box with the Finish button. Click Finish .

7. When the Import Data dialog box appears, click OK to accept the default values. When the query is complete, the data is displayed in Excel.

Summary

In this tutorial, you learned how to use Azure HDInsight to analyze Web site log data that uses Apache Hive. You browsed through a process to see how the raw data was first uploaded to an Azure storage Blob and then loaded into the hive table to execute the query. Finally, you learned how to import the results of a hive query into Microsoft Excel. If you have feedback on this tutorial or other examples, please use the Help + feedback link above.

Use the following links to continue to learn how to use hive and Excel with HDInsight.

    • add hive and HDInsight Hadoop use
    • use HDInsight Hadoop analysis Twitter data
    • microsoft  hive ODBC Excel connect to Hadoop

Reprint Please specify the Source: http://blog.csdn.net/yangzhenping , thank you!

Hdinsight-hadoop actual Combat (i) Website log analysis

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.