Recently work needs, to see hdinsight part, here to take notes. Nature is the most authoritative official information, so the contents are moved from here: https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-introduction/
Hadoop on HDInsight
Make big data, all know Hadoop, then hdinsight and hadoop what relationship? Hdinsight is a m$ Azure-based software architecture, mainly for data analysis, management, and it uses HDP (Hortonworks Data Platform) the Hadoop distribution. And then a little bit of attention, we're talking about Hadoop, which generally refers to the ecosystem of Hadoop, including Storm/hbase, not just the little elephant.
Hdinsight can be understood to be an Apache Hadoop implementation on Microsoft Azure that contains the corresponding storm, HBase, Pig, Hive, Sqoop, Oozie, Ambari, and, of course, bundled with their own Excel, Ssas,ssrs.
Hdinsight supports two types of operating systems, Linux and m$ own windows, the difference is mainly here:
CATEGORY |
HADOOP on LINUX |
HADOOP on WINDOWS |
Cluster OS |
Ubuntu 12.04 Long Term support (LTS) |
Windows Server R2 |
Cluster Type |
Hadoop |
Hadoop, HBase, Storm |
Deployment |
Azure Management Portal, Azure CLI, Azure PowerShell |
Azure Management Portal, Azure CLI, Azure PowerShell, HDInsight. NET SDK |
Cluster UI |
Ambari |
Cluster Dashboard |
Remote Access |
Secure Shell (SSH) |
Remote Desktop Protocol (RDP) |
Some basic concepts and definitions
-
hadoop (the "Query" workload): provides reliable data storage with hdfs, and a Simple mapreduc E programming model to process and analyze data in parallel.
-
hbase (the "NoSQL" workload): A NoSQL database built on Hadoop that provides random access and strong consiste Ncy for large amounts of unstructured and semi-structured data-potentially billions of rows times millions of columns. See overview of HBase on HDInsight.
-
Apache storm (the "Stream" workload): A distributed, real-time computation system for processing large streams of data fast. Storm is offered as a managed cluster in HDInsight. See analyze real-time sensor data using Storm and Hadoop.
Ambari:cluster provisioning, management, and monitoring.
Avro (Microsoft. NET Library for Avro): Data serialization for the Microsoft. NET environment.
Hive & hcatalog:structured Query Language (SQL)-like querying, and a table and storage management layer.
Mahout:machine Learning.
MapReduce and yarn:distributed processing and resource management.
Oozie:workflow Management.
Phoenix:relational database layer over HBase.
Pig:simpler Scripting for MapReduce transformations.
Sqoop:data Import and Export.
Tez:allows data-intensive processes to run efficiently on scale.
Zookeeper:coordination of processes in distributed systems.
HBase
Aaa
HDInsight-1, Introduction