Analyze the large data processing function of Microsoft Hadooponazure

Source: Internet
Author: User
Keywords Features Microsoft statements

In large data technology, Apache Hadoop and MapReduce are the most user-focused. But it's not easy to manage a Hadoop Distributed file system, or to write MapReduce tasks in Java. Then Apache hive may help you solve the problem.

  

The Hive Data Warehouse tool is also a project of the Apache Foundation, one of the key components of the Hadoop ecosystem, which provides contextual query statements, that is, hive queries. This set of statements can automatically translate SQL query into MapReduce work instructions.

In the bi realm, relational databases, including IBM DB2, Oracle, and SQL Server, have been in the dominant position. This makes SQL the preferred language for business intelligence, with most data analysis experts mastering more comprehensive SQL skills. Similarly, experts who do data analysis are more familiar with tools such as Excel,pivot tables and charts.

Let's first look at how an end-to-end BI Project works in Windows Azure systems. The data in the Excel chart showing the punctuality of American Airlines ' flights is large, and the entire process does not require any program code to be written.

Apache Hadoop on Windows Azure CTP

Late last year, Microsoft's SQL Server research team announced the Apache Hadoop feature on the Windows Azure platform, the hadooponazure. Microsoft said it would simplify the use and setup of Hadoop, using hive to extract unstructured data from the Hadoop cluster and analyze it in Excel tools, while enhancing the resilience of Windows Azure.

Hadooponazure's community preview is still in the open, and users need to fill out a simple questionnaire on Microsoft Connect to get the invitation code and log in with the Windows Live ID. Enter a unique DNS username, select the initial hadoop cluster size, provide a clustered login name and password, and click the Request Cluster button. (See Figure 1)

Figure 1 User can modify a cluster setting by simply doing it (click to enlarge)

It takes 15-30 minutes to open and set up the cluster. Hadooponazure Community Preview resources are free, but within 24 hours of opening the cluster, you need to update your subscription in the last 6 hours. The certificate needs to be updated on a daily basis during the subsequent use.

Users who want to use Windows Azure's BLOB Persistence data store need a subscription to Windows Azure and a storage account. Otherwise, when the cluster times out, all data stored on the HDFS will be lost. Without subscriptions, users can also apply for a free three-month Windows Azure account, which gives each user 20GB of storage, millions of storage transmissions, and 20GB of extranet bandwidth.

Populating a SQL Azure blob with large data

The Apache hive project extracted data from the Federal Aviation Administration (FAA) and collected information and delays on the arrival of flights on time from 5 months to January 2012, 6 months after 2011 years. A subset of the 6-page text data contains the FAA file bar, with 500,000 rows of data under the column per page MB.

Users need to upload data to a BLOB container folder so that hive can search for the data. For detailed steps about how to create azure blob source data, refer to my previous article. The article also mentions data files and how to download data using the Windows Live SkyDrive account, and finally how to upload the data to a Windows Azure blob.

After the cluster configuration is complete, the Elastic MapReduce Portal Login page and the cluster, Account Management dialog box will pop up. (See Figure 2)

Figure 2:hadooponazure MapReduce dashboard page features and features. (Click to enlarge)

Copy the Windows Azure Management Portal's primary Access key to the Clipboard, click Manage Cluster, open the page, and then click Set up ASV (Azure repository) to use the Windows Store account as the data source for the hive table. In addition, the user can use the data from Amazon S3 or Windows Azure dataplace DataMarket as the data source for the hive table.

Enter your storage account, paste primary access Key in the Passkey box, and click Save Settings,hive to successfully log on to the database access BLOB. If the certificate is authenticated, the user will receive SMS notification that the Azure account setting is successful.

Unlike HDFs, even the simplest kv (key-value) data requires a schema for hive tables. To generate a hive table from a HDFS or external tab symbol data and to name and define a data type for the column, the user needs to run the CREATE EXTERNAL table statement, as shown in the following hiveql:

CREATE EXTERNAL TABLE FLIGHTDATA_ASV (

Year INT,

Month INT,

Day INT,

Carrier STRING,

Origin STRING,

Dest STRING,

Depdelay INT,

Arrdelay INT

)

COMMENT ' FAA on-time data '

ROW FORMAT delimited FIELDS terminated by ' 9 '

STORED as Textfile

LOCATION ' Asv://aircarrier/flightdata ';

The Apache hive has relatively few data types and does not support date or time fields, but the source data *.csv corresponding integer segments such as year, month and day values are good for data maintenance. The delay time value of departure (Depdelay) and Arrival (Arrdelay) is presented in minutes.

Execute dynamic HIVEQL statement, you can click Elastic mapreduce Interactive Console, then click the Hive button to open the Dynamic Hive page, the top of the page appears read-only text box, click the text box below for instructions. (See Figure 3)

Figure 3:hive Chart Options list includes new chart titles, and column cells display a selected chart field name. Click the ﹥﹥ key to insert the selected entry in the cell. (Click to enlarge)

Download and install the Apache Hive ODBC driver and Excel plugin

Return to Elastic mapreduce Main page and click on the downloads panel. Locate the installation link that corresponds to the user Excel version, and then click Run to open the warning dialog box. Click More options, appear run Anyway option, click Start Installation, open ODBC Driver Startup Hive Setup dialog. Tick in the I accept box.

Click Install to start installation drive, finish and click Finish to exit the installation. Next, open Excel, click the Data tab, confirm that the Hive pane icon exists, click on the icon, and the Hive query dashboard appears to the right of the worksheet. The installation plugin places a hive pane icon into the Hive Data section of the directory.

Return to the EMR Control homepage, click Open Ports The Configure Ports page, click ODBC Server, drag to the right, and open TCP port 10000.

Perform an interactive Apache hive query

Return to Excel, click the Hive pane icon, open the Hive Query task box, click Enter Cluster details to open the ODBC Hive Setup dialog box, enter a description and DNS host name and accept the TCP port. Next, select Username/password authentication and enter your elastic MapReduce Portal instance username and password. (See Figure 4)

Figure 4: Each link, airport, TCP port, and cluster username password has its corresponding specific name. (Click to enlarge)

If the Hive option set in the ODBC Hive dialog box is correct, the name entered will pop up as a description field when the user opens the Select or enter Hive connection. Open the Select the Hive Object to query list and select FLIGHTDATA_ASV to generate the columns list.

To perform an aggregate query to show the average delay, you can tick the carrier and Arrdelay columns, open the list of functions for the Arrdelay field, and double-click the AVG in the list to add it to the HIVEQL statement (see Figure 5).

Figure 5: Select avg HIVEQL to aggregate query and double-click, HIVEQL feature is richer than most SQL

Next, Delimit limit results check box, open aggregate grouping list, select Carrier column.

Enter Arrdelay in AVG (), such as AVG (Arrdelay), which eliminates the pitfalls of the query design process and clicks Execute query to get the results. (See Figure 6)

Figure 6: This is the result of the HIVEQL query execution, B6 good F9 is the FAA-specific two byte code, B6 refers to the generation of Jet blue,f9 referring to Frontier Airlines.

Delete the wrong carrier entry, which may be due to an error in the header of each column, causing the information to be left in the document, and the results will appear in the query results. Keep a decimal number, turn off the task box, add information to the worksheet, add headings, x-axis headings, and data labels. (See Figure 7)

Figure 7:excel form is generated from the data in Figure 6

The paper gives an example to illustrate the simple process of running hadooponazure CTP. The Microsoft code for "Cloud Numerics" has similar functionality, but it requires us to operate in the context of visual Studio version 10. Hadooponazure can send tabular data directly to Excel for further analysis. In addition, interactive Hive, Hive ODBC data sources and corresponding Excel Plug-ins make Hadooponazure an ideal platform for large data processing.

Original link: http://www.searchbi.com.cn/showcontent_62711.htm

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.