Location of Hadoop

Source: Internet
Author: User

Reprint--http://www.weixuehao.com/archives/559

There are many great tools in the Hadoop framework that help us solve problems in our work.

Location of Hadoop

As can be seen, the more to the right, the higher the real-time, the more upward, involving algorithms, and so on more.

The more you go up, the more you go to the right, the more fire ...

Some introduction to the Hadoop framework

Hdfs

HDFS, (Hadoop Distributed File System) Hadoop distributed filesystem. It was developed by a Daniel after Google's open-source paper on DFS. HDFs is built on the cluster, it is suitable for the storage of petabytes of large amount of data, it has strong expansibility and high fault tolerance. It is also the basis of a Hadoop cluster, and most of it exists on HDFs.

Mapreduce

MapReduce, a computational framework in Hadoop, consists of two parts. Map operations and the reduce operation. MapReduce, which generates calculated tasks, is assigned to each node and performs calculations. This avoids moving the data on the cluster. And its internal, also has the function of fault-tolerant. When a node goes down during the calculation, there is a policy response. Hadoop clusters, some of the top tools, such as hive or pig, will be converted to basic mapreduce tasks to execute.

HBase

HBase originates from Google's bigtable. HBase is a columnstore-oriented database with high performance, scalability, and reliability. HBase content, stored in HDFs, of course, it can also use other file systems, such as S3. HBase is used as a top-level project with high frequency. Such as: we can use to store, crawler crawling pages of information and so on. The specific HBase concept is described in more detail later in this note. Low latency.

Hive

Hive, is a tool for querying, and in HBase, support for SQL is not very good. And hive solves this kind of problem. Manipulating HBase in SQL is a bit more cool. Some SQL statements written by hive, in fact, will eventually become a mapreduce program. Of course, this query can not be compared with the relational database, such as MySQL, hive query, is the second-level or minute-level, time is relatively long.

Sqoop

Sqoop is also a fantastic data synchronization tool. In a relational database, we encounter a scenario in which Oracle data is imported into MySQL, or MySQL data is imported into Oracle. That actually Sqoop is a similar function. Sqoop can import data from relational databases such as Oracle,mysql to Hbase,hdfs, and can also be imported from HDFs or hbase to MySQL or Oracle.

Flume

Flume, is a log collection tool that is distributed, reliable, fault-tolerant, and can be customized. Application scenarios such as: 100 servers, you need to monitor the operation of each server, you can use flume to the log of each server, collected. There are also two versions of Flume. Flume OG and Flume NG. Now it's basically NG.

Impala

Impala is a new query system led by Cloudera, which provides SQL semantics to query petabytes of big data stored in Hadoop's HDFs and HBase. The existing hive system, while providing SQL semantics, is still a batch process that is difficult to meet the interactivity of the query because it uses the MapReduce engine for the underlying implementation of hive. By contrast, Impala's biggest feature is its fast-selling point. Imapa can be associated with Phoenix,spark SQL to find out.

Spark

Spark is a memory-computing framework. A big trend at the moment. MapReduce has a lot of IO operations, and Spark is calculated in memory. The speed is 10 times times that of Hadoop (as the official website says). Spark is a current trend that needs to be understood.

Zookeeper

Zookeeper, animal keeper. Zookeeper is called distributed Collaboration services. The main functions are unified naming, state synchronization, cluster management, and Configuration synchronization. Zookeeper is useful in hbase, as well as in hadoop2.x.

Mahout

A library of data mining algorithms built into a large number of algorithms. Can be used for prediction, classification, clustering and so on. The tools are powerful, but the technical requirements are high.

Pig

Similar to hive. Specific differences in their own search. Pig can build a data warehouse. It can be used for querying and analyzing data in Data Warehouse. Pig also has its own query syntax, unfortunately, not in SQL form, Pig Latin.

Ambari

Ambari is a management platform. A unified deployment of the cluster is possible. is also very convenient.

Location of Hadoop

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.