Hadoop in the Big Data era (vi): The Hadoop ecosystem (pig,hive,hbase,zookeeper,sqoop)

Source: Internet
Author: User
Tags hadoop ecosystem sqoop

Hadoop is a distributed system infrastructure developed by the Apache Foundation, which provides two main functions: distributed storage and distributed computing . Distributed storage is the basis of distributed computing, in the implementation of Hadoop, provides a distributed storage interface, and its own implementation of a distributed storage is HDFS, but does not mean that Hadoop only supports the implementation of HDFS, while supporting other storage systems, and run the Distributed Computing Program (MAPREDUCE) on other storage systems.


from the point of view of development, Hadoop has reserved two interfaces for developers, namely the map interface and the reduce interface, and the whole process of processing is fixed, that is, the user can do is based on the specific project needs to find the right method to implement their own map function and reduce function, So as to achieve the goal.


learning Hadoop from the whole is a bit difficult after all, but there are already some open source tools that have done a lot for us, such as pig,hive,hbase, and the focus of this section is to understand some of these open source tools built on Hadoop, It can also be called the ecosystem of Hadoop.

1. Pig

Pigs Eat anything!


Pig is Yahoo! Invented to make it easier for researchers and engineers to dig large datasets.


Pig provides a higher level of abstraction for processing large datasets. MapReduce enables programmers to customize the map and reduce functions for continuous execution. However, data processing often requires multiple mapreduce processes to be implemented, so it is complex to rewrite the data processing requirements into MapReduce mode.


Compared to MapReduce, Pig provides a richer data structure and provides a powerful set of data transformation operations.


Pig consists of two parts:
The language used to describe the data flow, called Pig Latin.
The execution environment used to run the Pig Latin program. In two modes: the local environment in a single JVM and the distributed execution environment on a Hadoop cluster .

A pig Latin program consists of a series of "operations (Operation)" and "Transformations (transformation)". Each operation or transformation processes the input and produces the output. These operations describe a data flow as a whole. The pig execution environment translates the data stream into an executable internal representation and runs it.


Example:

[HTML]View Plaincopyprint?
  1. --load the data and load it in the format specified after as
  2. Records = load '/home/user/input/temperature1.txt ' as (year:chararray,temperature:int);
  3. --Print Records object
  4. Dump Records;
  5. Describe records;
  6. --Filter out the temperature!=999 data
  7. Valid_records = filter records by temperature!=999;
  8. --Grouped by year
  9. Grouped_records = Group valid_records by year;
  10. Dump Grouped_records;
  11. Describe Grouped_records;
  12. --Take the maximum number
  13. Max_temperature = foreach grouped_records generate Group,max (valid_records.temperature);
  14. --Note: Valid_records is the field name, and you can see the specific structure of group_records in the describe command result of the previous statement.
  15. Dump Max_temperature;


Compared to traditional databases:

Pig Latin is a data flow programming language , and SQL is a descriptive programming language.
Pig does not support things and indexes, and low latency queries are not supported .

2. Hive

Hive is a data Warehouse framework built on Hadoop that is designed to enable proficient SQL skill analysts to query on Facebook's large-scale datasets stored in HDFs.

Hive transforms the query into a series of mapreduce jobs that run on the Hadoop cluster. Hive organizes the data into tables, and in this way assigns the structure to the data stored in HDFs. metadata, such as the table schema, is stored in a database named Metastore.


Example:

[HTML]View Plaincopyprint?
    1. (1) Create a table
    2. CREATE TABLE csdn (username string,passw string,mailaddr STRING) row format delimited fields terminated by ' # ';
    3. (2) Load local file into CSDN table:
    4. LOAD DATA LOCAL inpath '/home/development/csdnfile ' OVERWRITE into TABLE csdn;
    5. (3) Execute the query and output the results to a local directory:
    6. INSERT OVERWRITE LOCAL DIRECTORY '/home/development/csdntop ' SELECT passw,count (*) as Passwdnum from CSDN GROUP by PASSW O Rder by passwdnum Desc;


Compared to traditional databases:


Hive is between pig and traditional RDBMS, and the query language of hive is HIVEQL based on SQL.
Hive's validation of the data is not performed while the data is being loaded, but at query time, called "read-time mode", while the traditional database is "realistic mode".
Hive also does not support things and indexes , and low latency queries are not supported.

3. HBase

HBase is a column-oriented distributed database developed on HDFS and supports the random read and write of ultra-large datasets in real time .


HBase is an ultra-large sparse table that we can manage on a cluster of inexpensive hardware.


Data model


In HBase, the table is versioned, the contents of the cell is a byte array, the primary key of the row in the table is also a byte array, and all access to the table is passed through the primary key.


The column family of a table must be given in advance as part of the table schema definition, but new column members in the column family can subsequently join as needed.

HBase is a column-oriented family of memory, which is stored by column family. HBase stores the table in a horizontal direction, with a subset of the tables in each region.


HBase is a distributed, column-oriented data storage System .


4, ZooKeeper

Zookeeper is a distributed coordination service for Hadoop, which was born in Yahoo Company.


Zookeeper provides a set of tools that allow us to process partial failures when building distributed references .

5, Sqoop

Sqoop is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. The extracted data can be used by the MapReduce program or by other hive-like tools. Once the analysis results are formed, sqoop can then bring these results back to the database for use by other clients.

Hadoop in the Big Data era (vi): The Hadoop ecosystem (pig,hive,hbase,zookeeper,sqoop)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.