Hdinsight-hadoop actual Combat (i) Website log analysis
Brief introduction
In this example, you will use the HDInsight query that parses the Web site log file to gain insight into how customers use the site. With this analysis, you can see a summary of the frequency of visits to the site within a day of the external site and the site error of the user experience.
In this tutorial, you will learn how to use HDInsight:
- Connect to an Azure Storage Blob that contains a Web site log file
- Create a hive table to query these logs
- Create a hive query to analyze the data
- Use Microsoft Excel to connect to HDInsight (using an ODBC connection) to retrieve the parsed data
Prerequisite
You have used the cluster to configure everything you need to complete your scripts and queries. To export the analyzed data to Microsoft Excel, you must meet the following requirements:
- You must have Microsoft Excel , or Microsoft Excel, installed.
- You must have a Microsoft hive ODBC driver to import data from the hive into Excel. Select a 32-bit or 64-bit version based on a Microsoft Excel version.
Web site log data that has been loaded into a Windows Azure storage Blob
The following is the location of the site log data used by this example. You can access this data from the file Browser tab at the top of this page. You can also do this in the [default storage Account]/[defaultcontainer]/hdisamples/websitelogsampledata/samplelog Path to access the data for this sample.
Site log Data
Wasb://[email protected]/hdisamples/websitelogsampledata/samplelog/
Creating a hive table to query Web site log data
The following hive statement creates an external table that allows the hive to query data stored in Azure Blob storage. The external table preserves the data in its original file format, while allowing the hive to execute queries against the data within the file.
A hive statement creates a new table called a blog by describing the fields within the file, the qualifiers between fields, and the location of the files in Azure Blob storage. In this tutorial, you create a hive query to analyze the data section, and you execute the query against the data that is stored in this table.
Createexternal Table weblogs
DROP TABLE ifexists weblogs;
--create tableweblogs on space-delimited website log data
CREATE externaltable IF not EXISTS weblogs (s_date date, s_time string, S_sitename string,cs_method string, Cs_uristem stri Ng
cs_uriquerystring, S_port int, cs_username string, c_ip string, cs_useragent string,
Cs_cookiestring, Cs_referer string, cs_host string, sc_status int, sc_substatus int,
Sc_win32statusint, sc_bytes int, cs_bytes int, S_timetaken int)
ROW formatdelimited fields TERMINATED by '
STORED astextfile location ' Wasb://[email protected]/hdisamples/websitelogsampledata/samplelog/'
Tblproperties (' skip.header.line.count ' = ' 2 ');
Create a hive query to analyze the data
The following hive query creates two new tables based on the queries that are running on the Web log table. The new table is named clienterrors and refersperday.
clienterrors queries from the to to sc_status column. The extracted data is then sorted by the number of occurrences of each error code and written to the clienterrors table.
Refersperday Query extracts data from the blog table of all external sites that reference this site. External site information is extracted from the cs_referer column of the blog table. To ensure that the reference link does not encounter errors, the table displays only the page data that returns the HTTP status code between. The extracted data is then written to the refersperday table.
DROP TABLE ifexists clienterrors;
--create tableclienterrors for storing errors users experienced and their frequencies
CREATE externaltable clienterrors (sc_status int, cs_referer string, cs_page string, cnt int)
ROW formatdelimited fields TERMINATED by ', ';
--populate tableclienterrors with data from table weblogs
INSERT overwritetable clienterrors
SELECT Sc_status,cs_referer,
Concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4} -[0-9A-F]{12} ', ') cs_page,
Count (DISTINCTC_IP) as CNT
From weblogs
WHERE sc_status>=400 and Sc_status < 500
GROUP Bysc_status, Cs_referer, concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f ]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12} ', ')
ORDER by CNT;
-------------------------------------------------------------------------------------------
DROP TABLE ifexists Refersperday;
--create Tablerefersperday for storing references from external websites
CREATE externaltable IF not EXISTS refersperday (year int, month int, day int, cs_refererstring, cnt int)
ROW formatdelimited fields TERMINATED by ', ';
--populate tablerefersperday with data from the Weblogs table
INSERT overwritetable Refersperday
Selectyear (S_date), month (s_date), Day (s_date), Cs_referer, COUNT (distinct c_ip) ascnt
From weblogs
WHERE sc_status>=200 and Sc_status <300
GROUP by S_date,cs_referer
ORDER by Cntdesc;
-------------------------------------------------------------------------------------------
Executing query
Click Submit to run the query shown in the previous section. The query performs the following tasks:
- Create a blog table from the original site log data in the Azure Blob storage associated with the HDInsight cluster.
- Create and populate the clienterrors and refersperday tables described in the previous section.
When you run a query, you can click View Details to get more information about running tasks in the background. After all the jobs at the bottom of the page are in the completed state, continue to load the data into Excel.
DROP TABLE ifexists weblogs;
--create tableweblogs on space-delimited website log data
CREATE externaltable IF not EXISTS weblogs (s_date date, s_time string, S_sitename string,cs_method string, Cs_uristem stri Ng
cs_uriquerystring, S_port int, cs_username string, c_ip string, cs_useragent string,
Cs_cookiestring, Cs_referer string, cs_host string, sc_status int, sc_substatus int,
Sc_win32statusint, sc_bytes int, cs_bytes int, S_timetaken int)
ROW formatdelimited fields TERMINATED by '
STORED astextfile location ' Wasb://[email protected]/hdisamples/websitelogsampledata/samplelog/'
Tblproperties (' skip.header.line.count ' = ' 2 ');
DROP TABLE ifexists clienterrors;
--create tableclienterrors for storing errors users experienced and their frequencies
CREATE externaltable clienterrors (sc_status int, cs_referer string, cs_page string, cnt int)
ROW formatdelimited fields TERMINATED by ', ';
--populatetable clienterrors with data from table weblogs
Insertoverwrite TABLE clienterrors
Selectsc_status, Cs_referer,
Concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4} -[0-9A-F]{12} ', ') cs_page,
COUNT (distinct c_ip) ascnt
From weblogs
WHERE sc_status>=400 and Sc_status < 500
GROUP Bysc_status, Cs_referer, concat (Cs_uristem, '? ', Regexp_replace (Cs_uriquery, ' x-arr-log-id=[0-9a-f]{8}-[0-9a-f ]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12} ', ')
ORDER by CNT;
DROP TABLE ifexists Refersperday;
--create Tablerefersperday for storing references from external websites
CREATE externaltable IF not EXISTS refersperday (year int, month int, day int, cs_refererstring, cnt int)
ROW formatdelimited fields TERMINATED by ', ';
--populatetable refersperday with data from the Weblogs table
Insertoverwrite TABLE Refersperday
Selectyear (S_date), month (s_date), Day (s_date), Cs_referer, COUNT (distinct c_ip) ascnt
From weblogs
WHERE sc_status>=200 and Sc_status <300
GROUP Bys_date, Cs_referer
ORDER by Cntdesc;
Job session
Query name |
Date |
Id |
Operation |
State |
No data available in table |
Loading data into Excel
After you create a hive table, you can use the Microsoft hive ODBC Drive to import data from the hive to Excel. After you install the driver, use the following steps to connect to the table.
1. Open Excel and create a blank worksheet.
2. From the Data tab, select from a different source and select from Microsoft query .
3. When prompted to select a data source , select the sample Microsoft hive DSN.
4. In the Microsoft hive ODBC Drive Connection dialog box, enter the following values, and then click OK.
- host -the host name of the HDInsight cluster. For example, mycluster.azurehdinsight.net
- User name -administrator name for the HDInsight cluster
- Password -Administrator password
All other fields are the default values.
5. In the Query Wizard, select the refersperday table and select the > button.
6. Click next to continue to view the wizard until you reach the dialog box with the Finish button. Click Finish .
7. When the Import Data dialog box appears, click OK to accept the default values. When the query is complete, the data is displayed in Excel.
Summary
In this tutorial, you learned how to use Azure HDInsight to analyze Web site log data that uses Apache Hive. You browsed through a process to see how the raw data was first uploaded to an Azure storage Blob and then loaded into the hive table to execute the query. Finally, you learned how to import the results of a hive query into Microsoft Excel. If you have feedback on this tutorial or other examples, please use the Help + feedback link above.
Use the following links to continue to learn how to use hive and Excel with HDInsight.
- add hive and HDInsight Hadoop use
- use HDInsight Hadoop analysis Twitter data
- microsoft hive ODBC Excel connect to Hadoop
Reprint Please specify the Source: http://blog.csdn.net/yangzhenping , thank you!
Hdinsight-hadoop actual Combat (i) Website log analysis