Summary of some statements of HQL

Source: Internet
Author: User
Tags integer numbers hadoop ecosystem sqoop

HQL Original from : http://slaytanic.blog.51cto.com/2057708/782175/ slaytanic teacher

about the Hadoop the introduction from:http://www.cnblogs.com/shishanyuan/p/4629300.html mountain Garden teacher , his Hadoop blog is well written, interested can go to see

Hive is a Hadoop-based query tool for Facebook's open source, meaning that if you need hive, you'll install Hadoop first.

The ecosystem of Hadoop is roughly as follows:

The basic component of the L-hdfs--hadoop ecosystem is the Hadoop Distributed File System (HDFS). HDFs is a kind of data distributed Preservation mechanism, the data is saved in the computer cluster , HDFS provides the foundation for such tools as HBase.

The main implementation framework of the L-Mapreduce--hadoop is MapReduce, a distributed, parallel processing programming model in which MapReduce divides the task into the map (map) phase and the reduce (simplification). Because of the nature of the mapreduce work principle, Hadoop accesses data in parallel, enabling fast access to data.

L Hbase--hbase is a column-oriented NoSQL database built on HDFs to quickly read/write large amounts of data. HBase is managed using zookeeper to ensure that all components are functioning properly.

L zookeeper-- Distributed coordination Services for Hadoop. Many of the components of Hadoop depend on zookeeper, which runs on top of a computer cluster for managing Hadoop operations.

L pig--It is an abstraction of the complexity of MapReduce programming. The pig platform includes a running environment and a scripting language for analyzing Hadoop datasets (pig Latin). Its compiler translates the Pig Latin into a sequence of mapreduce programs.

L hive--hive is similar to the SQL high-level language for running query statements stored on Hadoop, and Hive allows developers who are unfamiliar with MapReduce to write data query statements. These statements are then translated into the MapReduce task above Hadoop. Like Pig, Hive, as an abstraction layer tool, attracts a lot of data analysts who are familiar with SQL rather than Java programming.

L Sqoop is a connectivity tool for transferring data between relational databases, data warehouses, and Hadoop. Sqoop uses the database technology to describe the schema, the import/export of data, and the use of MapReduce to implement parallel operation and fault-tolerant technology.

L Flume provides a distributed, reliable, and efficient service for collecting, aggregating , and transferring large amounts of data from a single computer to HDFs. It is based on a simple and flexible architecture and provides the flow of data streams. It uses a simple, extensible data model to transfer data from multiple computers in the enterprise to Hadoop.

For example, our company, currently through the flume will be stored on the X server data pull back, in the process of pulling it will each server each time period (every 10 minutes for a time period) of the TXT file merged, the file named: yyyy-mm-dd_hh_mm.tmp.  Pull the retrieved data in the flumetable, change their suffix to log, then place the batch into the DataTable, then each batch of data will be divided into two tables (calculated in hive), the two tables through a variety of calculations and join operations, the report, pushed to SQL Server so that the front-end renders. (This paragraph I wrote so very slag, this is the current data processing process, seems to have omitted, I do not seem to know much, but also need to learn, after all, even their own company's process is not clear, even if I am not a data mining engineer. )

OK, in the following:

1. The data type of the field.

Hive actually sets the data type for the field of the Hive table for data mining purposes, and it can also set index for frequently where.

There are several types of data:

string with indefinite length
TINYINT 3-bit long integer number
SMALLINT 5-bit long integer type
INT 10-bit integer
BIGINT 19-bit integral type
Float floating point number
Double Dual Precision
Boolean Boolean, which is true and false

Different integer numbers have different bit limits, which you need to be aware of when creating a table, not because the number of digits is not enough to intercept the data. The number of bits is too large, which can result in the wasted space of metadata.

There are three other kinds of infrequently used
STRUCTS Structural Body
Array arrays
MAP This does not know how to translate the appropriate

2. Create a data table.

The data table for hive is divided into two types, inner and outer tables.

An internal table is a table that hive creates and passes through the load data inpath into the database, which can be understood as data tables that are saved together with the table structure. When you delete a table structure in metadata by dropping table table_name, the data in the table is also deleted from HDFs.

The external table refers to the data that was saved in HDFs before the table structure was created, and the data is formatted into the structure of the table by creating a table structure. When drop table table_name, hive simply removes the table structure of the metadata without deleting the files on the HDFs, so the external table can be used more confidently than the internal table.

I have not understood the difference between the internal table and the external table, after the teacher said, quickly understand. The internal table is the structure of the first data, the external representation of the first data, and then set up a structure to put the data in. When we drop table , the internal table is a strap data is deleted, the external table only the structure is deleted, the data is still in the so-called left Castle Peak in, Hollies.

3 , internal table-built table statements:

CREATE TABLE Database.table1
(
Column1 STRING COMMENT ' Comment1 ',
Column2 INT COMMENT ' Comment2 '
);

4, the external table to build the table statement:

Here is how to create an external table when the files in HDFs are not lzo compressed and plain text is saved:

CREATE EXTERNAL TABLE IF not EXISTS database.table1
(
Column1 STRING COMMENT ' Comment1 ',
Column2 STRING COMMENT ' Comment2 '
)
ROW FORMAT delimited fields TERMINATED by "\ T"
Location ' hdfs:///data/dw/asf/20120201 ';

Here's how to create an external table when the files in HDFs are compressed with Lzo, and of course you need HADOOP-GPL support to read the Lzo file as text.

CREATE EXTERNAL TABLE IF not EXISTS database.table1
(
Column1 STRING COMMENT ' Comment1 ',
Column2 STRING COMMENT ' Comment2 '
)
ROW FORMAT delimited fields TERMINATED by "\ T"
STORED as InputFormat "Com.hadoop.mapred.DeprecatedLzoTextInputFormat" OutputFormat " Org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat "
Location ' hdfs:///data/dw/asf/20120201 ';

Red part is very fucked, almost all of the article on the Internet is copied paste copy, no exception, all written sorted, if you find the hive Chinese information is not slaytanic teacher This article, I am afraid to create an external table will be an error.

-- I usually use more of the internal table, so there is no say to the external table. Another: Why does the teacher not specify a delimiter when creating an internal table? You can do it without specifying it?

5 , internal tables load data

L load data from the local file system:

The local storage path of the Load data locally inpath ' file ' overwrite into table basedatadb.table1;

If overwrite is added, the data in the original table will be overwritten.

L Load data from HDFs:

Load data inpath ' file storage path in HDFs ' into table basedatadb.table1;

I haven't used it myself.

6 , Partitions

L-Built Zoning

Alter table Basedatadb.table1add partition (day= ' 2015-07-26 ');

L Delete Partition:

Alter table Basedatadb.table1 droppartition (day= ' 2015-07-26 ');

to be continued, the subsequent replenishment of the table structure, add a new column ~ ~ ~ ~ ~ ~ ~ ~ ~ ~. Did not expect to write a blog so much time, originally planned half an hour of 2 (/(ㄒoㄒ)/~~, in fact, can not be counted, just in the copy of the teacher's summary, and then add their own questions and summary, really can not calculate their own works. But I will work hard, I hope one day I can create their own share blog.

Summary of some statements of HQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.