Reprint--http://www.weixuehao.com/archives/559
There are many great tools in the Hadoop framework that help us solve problems in our work.
Location of Hadoop
As can be seen, the more to the right, the higher the real-time, the more upward, involving algorithms, and so on more.
The more you go up, the more you go to the right, the more fire ...
Some introduction to the Hadoop framework
Hdfs
HDFS, (Hadoop Distributed File System) Hadoop distributed filesystem. It was developed by a Daniel after Google's open-source paper on DFS. HDFs is built on the cluster, it is suitable for the storage of petabytes of large amount of data, it has strong expansibility and high fault tolerance. It is also the basis of a Hadoop cluster, and most of it exists on HDFs.
Mapreduce
MapReduce, a computational framework in Hadoop, consists of two parts. Map operations and the reduce operation. MapReduce, which generates calculated tasks, is assigned to each node and performs calculations. This avoids moving the data on the cluster. And its internal, also has the function of fault-tolerant. When a node goes down during the calculation, there is a policy response. Hadoop clusters, some of the top tools, such as hive or pig, will be converted to basic mapreduce tasks to execute.
HBase
HBase originates from Google's bigtable. HBase is a columnstore-oriented database with high performance, scalability, and reliability. HBase content, stored in HDFs, of course, it can also use other file systems, such as S3. HBase is used as a top-level project with high frequency. Such as: we can use to store, crawler crawling pages of information and so on. The specific HBase concept is described in more detail later in this note. Low latency.
Hive
Hive, is a tool for querying, and in HBase, support for SQL is not very good. And hive solves this kind of problem. Manipulating HBase in SQL is a bit more cool. Some SQL statements written by hive, in fact, will eventually become a mapreduce program. Of course, this query can not be compared with the relational database, such as MySQL, hive query, is the second-level or minute-level, time is relatively long.
Sqoop
Sqoop is also a fantastic data synchronization tool. In a relational database, we encounter a scenario in which Oracle data is imported into MySQL, or MySQL data is imported into Oracle. That actually Sqoop is a similar function. Sqoop can import data from relational databases such as Oracle,mysql to Hbase,hdfs, and can also be imported from HDFs or hbase to MySQL or Oracle.
Flume
Flume, is a log collection tool that is distributed, reliable, fault-tolerant, and can be customized. Application scenarios such as: 100 servers, you need to monitor the operation of each server, you can use flume to the log of each server, collected. There are also two versions of Flume. Flume OG and Flume NG. Now it's basically NG.
Impala
Impala is a new query system led by Cloudera, which provides SQL semantics to query petabytes of big data stored in Hadoop's HDFs and HBase. The existing hive system, while providing SQL semantics, is still a batch process that is difficult to meet the interactivity of the query because it uses the MapReduce engine for the underlying implementation of hive. By contrast, Impala's biggest feature is its fast-selling point. Imapa can be associated with Phoenix,spark SQL to find out.
Spark
Spark is a memory-computing framework. A big trend at the moment. MapReduce has a lot of IO operations, and Spark is calculated in memory. The speed is 10 times times that of Hadoop (as the official website says). Spark is a current trend that needs to be understood.
Zookeeper
Zookeeper, animal keeper. Zookeeper is called distributed Collaboration services. The main functions are unified naming, state synchronization, cluster management, and Configuration synchronization. Zookeeper is useful in hbase, as well as in hadoop2.x.
Mahout
A library of data mining algorithms built into a large number of algorithms. Can be used for prediction, classification, clustering and so on. The tools are powerful, but the technical requirements are high.
Pig
Similar to hive. Specific differences in their own search. Pig can build a data warehouse. It can be used for querying and analyzing data in Data Warehouse. Pig also has its own query syntax, unfortunately, not in SQL form, Pig Latin.
Ambari
Ambari is a management platform. A unified deployment of the cluster is possible. is also very convenient.
Location of Hadoop