Introduction to Hdinsight and Azure storage services

Source: Internet
Author: User
Keywords Azure hdinsight azure

We rehearsed the updated Windows Azure hdinsight service in our previous http://www.aliyun.com/zixun/aggregation/39815.html > blog post. Today's article, the 3rd in a series of 5 blogs introducing Hdinsight, will focus on Hdinsight and Azure Storage.

A noteworthy and distinctive aspect of Windows Azure Hdinsight Services is the ability to choose where to store data. You can store the data in a native HDFS file system that is local to the compute node or use the Azure Blob store container as the HDFS file system to store the data. In fact, when you configure a hdinsight cluster, it will create the Azure Blob store container as the default HDFS file system in your storage account by default.

Alternatively, you can create a cluster by customizing the creation option to select an existing Azure Blob Store Container as the default HDFS file system. For example, in this screenshot you can see how to designate the Blob Store Container named "Netflix" as the default file system.

The Container may have previously been configured as a hdinsight HDFS file system, or it may be an arbitrary Azure Blob Store that happens to contain the data you want to analyze container!

In our case, the Netflix container contains three blobs that use the folder naming scheme:

Benefits of using Azure Storage Container

Although storage container is not local to compute nodes, it seems to run counter to the Hadoop paradigm that will compute and store organising, but there are several benefits to storing data in the Azure Blob store container:
-data reuse and sharing: computing the data within a node is "Lockdown" after the HDFS API. This means that the data is available only to applications that can detect HDFS and have access to the compute cluster. The data in Azure Storage container can be accessed either through the HDFS API or through the Azure Blob Store REST API. As a result, you can use a larger set of applications and tools to generate and work with data, and other applications can use that data while the data is generated by different applications.
-Data archive: Because the lifetime of the data within the calculation node is the same as that of the Hdinsight cluster that you configured, you must have the duration of the cluster to exceed the calculation time, or you must reload the data into the cluster each time that you configure the cluster to perform the calculation. In Azure Storage container, you can store data for any length of time.
-Data storage cost: The cost of storing data in an active hdinsight cluster for a long time is higher than the cost of storing the data in the Azure Storage container because the cost of computing the cluster is higher than that of the Azure Blob store container. In addition, data-loading costs can be saved by not having to reload the data as each compute cluster is built.
-Flexible scaling: Although the Hdinsight cluster provides a scalable file system, scalable capacity is determined by the number of nodes that you configure for the cluster. Changing scalable capacity can become a more complex process, and it is much simpler to automatically get Azure Blob Store flexible scalability by using Azure Storage container.

-geo-regional replication: The Azure Blob Store container can be geographically replicated via the Azure portal! While this enables geo-regional recovery and data redundancy, replicating data to other geographic areas for recovery failures can significantly impact your system performance and may generate additional costs. We therefore recommend that geographic area replication be wisely chosen only when the value of the data is worth the extra cost.
In addition, the performance costs implied by not organising the calculation and storage are actually reduced by the way in which the compute cluster is configured to store account resources near the azure datacenter, and high-speed networks in Azure data centers enable compute nodes to access data in ASV very efficiently. In normal load, calculation, and access mode, we only observe a slight performance downgrade, and usually have faster access speed!
Also note that you can save data load time and data movement costs by not having to reload the data into the file system every time you configure the Hdinsight cluster!

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.