HDInsight-1, Introduction

Source: Internet
Author: User
Tags sqoop hdinsight

Recently work needs, to see hdinsight part, here to take notes. Nature is the most authoritative official information, so the contents are moved from here: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/

Hadoop on HDInsight

Make big data, all know Hadoop, then hdinsight and hadoop what relationship? Hdinsight is a m$ Azure-based software architecture, mainly for data analysis, management, and it uses HDP (Hortonworks Data Platform) the Hadoop distribution. And then a little bit of attention, we're talking about Hadoop, which generally refers to the ecosystem of Hadoop, including Storm/hbase, not just the little elephant.

Hdinsight can be understood to be an Apache Hadoop implementation on Microsoft Azure that contains the corresponding storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and, of course, bundled with their own Excel, Ssas,ssrs.

Hdinsight supports two types of operating systems, Linux and m$ own windows, the difference is mainly here:

CATEGORY HADOOP on LINUX HADOOP on WINDOWS
Cluster OS Ubuntu 12.04 Long Term support (LTS) Windows Server R2
Cluster Type Hadoop Hadoop, HBase, Storm
Deployment Azure Management Portal, Azure CLI, Azure PowerShell Azure Management Portal, Azure CLI, Azure PowerShell, HDInsight. NET SDK
Cluster UI Ambari Cluster Dashboard
Remote Access Secure Shell (SSH) Remote Desktop Protocol (RDP)

Some basic concepts and definitions

    • hadoop  (the "Query" workload): provides reliable data storage with hdfs, and a Simple mapreduc E programming model to process and analyze data in parallel.

    • hbase  (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consiste Ncy for large amounts of unstructured and semi-structured data-potentially billions of rows times millions of columns. See overview of HBase on HDInsight.

    • Apache storm  (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See analyze real-time sensor data using Storm and Hadoop.

    • Ambari:cluster provisioning, management, and monitoring.

    • Avro (Microsoft. NET Library for Avro): Data serialization for the Microsoft. NET environment.

    • Hive & hcatalog:structured Query Language (SQL)-like querying, and a table and storage management layer.

    • Mahout:machine Learning.

    • MapReduce and yarn:distributed processing and resource management.

    • Oozie:workflow Management.

    • Phoenix:relational database layer over HBase.

    • Pig:simpler Scripting for MapReduce transformations.

    • Sqoop:data Import and Export.

    • Tez:allows data-intensive processes to run efficiently on scale.

    • Zookeeper:coordination of processes in distributed systems.

HBase

There are two versions of the shipment, one is Apache HBase, open source, NoSQL, Hadoop-based and dog-bigtable, and is well supported for massive structured and semi-structured data access. The other is Hdinsight HBase, Microsoft's own. The data is stored directly in the BLOB.

HBase data can be managed by the Create/get/put/scan command of the HBase shell, which is the data that reads multiple rows. There is also a rest-mode C # API that can be called.

Usage Scenarios for HBase

The original intention is that Google for its own web search, you searched the three-body, it has all the three-body pages are returned to you. In addition, it includes:

    • Key-value storage, this is suitable for message management, such as Facebook.
    • Sensor data, including but not limited to social data, time-related data, audit logs, etc.
    • Real-time query, such as Phoenix, is an Apache hbase SQL query engine

Storm

The website introduces a distributed, fault-tolerant, open-source computing system that can process Hadoop data in real time.

Storm in Hdinsight has the following characteristics:

    • SLA commitment is 999
    • Storm components can use Java/c#/python to
    • Built-in mechanisms for scale-up and Scale-down
    • can be integrated with eventhub/virtual Network/sql/blob/documntdb

Scenarios for real-time processing

    • Internet of Things (IoT)
    • Fraud detection
    • Social Analytics
    • Extract, Transform, Load (ETL)
    • Network Monitoring
    • Search
    • Mobile Engagement

Spark

Apache Spark, an open-source, parallel processing framework that supports in-memory Big data analytics.

Applicable scenarios:

    • Interactive data analysis and BI processing
    • Iterative machine learning (what is this?) )
    • Streaming and real-time data processing

HDInsight-1, Introduction

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.