Foreword
I still have awe of technology.
Hadoop overview
Hadoop is an open source distributed cloud computing platform based on the Map / Reduce model, an offline analysis tool for processing massive amounts of data. Based on Java development, built on HDFS, first proposed by Google, interested students can learn from the Google troika: GFS, mapreduce, Bigtable to understand, I do not detail here, because the online information is too much .
The Hadoop project is structured as follows:
Hadoop should be the most important HDFS and Mapreduce, from HDFS Speaking:
HDFS mainly by the following advantages:
1), support for large files, in general, a Hadoop file system can easily store TB, PB level data.
2) It is very common to detect and quickly cope with hardware failures. Clusters built on a large amount of common and inexpensive hardware, especially hardware faults, are common. One HDFS system consists of hundreds of servers that store data files. The more The server also means high failure rates, so fault detection and brake recovery are a design goal of HDFS.
3), streaming data access, HDFS data to be processed are relatively large, the application needs to access a large amount of data at a time, suitable for batch processing rather than user interactive processing of data, HDFS to stream data access, the focus is High throughput of data rather than access speed. HDFS is based on the concept that the most efficient data processing mode is write-once, read-many-times mode. When a write operation is disabled, HDFS wants to update the file Content is not supported. HDFS stored data sets as hadoop analysis object. After the data set is generated, various analyzes are performed on this data set for a long time. Each analysis will design most or all of the data, so it is more important to read the entire data set than it is to read the first record. (Streaming reads minimize the drive's addressing overhead and only need to be addressed once and then read all the time.) After all, each file block is partitioned into 64M size. Physical construction of the hard disk results in optimization of the addressing overhead Can not keep up with the reading overhead, so streaming read more suitable for the hard disk itself characteristics.Of course the characteristics of large files are more suitable for streaming read.With streaming data access is random data access, which requires positioning, query or modify The data delay is small, the traditional relational database is consistent with this)
4) Simplify the consistency model, most of the HDFS data manipulation files are written once, read many times, once a file is created, written and closed, half do not need to be modified, so simple Consistency model is conducive to providing high-throughput data access model.
For the above design goals, we will find that these advantages in some scenarios, but in the case of dictation, will become its limitations, the following main points:
1) is not suitable for low-latency data access. HDFS is designed to handle large-scale data and is designed primarily to achieve high data throughput. HDFS sacrifices low latency for high throughput, so we can not Expect to be able to read HDFS data quickly.
HBase is a great solution for low-latency or real-time data access to data on Hadoop. But Hbase is a NOSQL, a column-oriented database.
2), can not be funny to store a large number of small files, in the HDFS, NameNode (Master) node to manage the file system metadata, the corresponding client request returns the file location, the file size limit by NameNode Is determined by its memory size); in addition, more data per file also means more disk addressing and more overhead for opening and closing files Greatly reducing the data throughput, which are against the HDFS design goals, but also to NameNode greater work pressure.
3), does not support multi-user data write and random access and modify the file, only one user in a HDFS write operation to a file write operation, and since the additional way to write data to the end of the file, read When a file can only read the file data sequentially from the file header.
At present, HDFS does not support multiple users concurrent write operation on the same file, random access and modify data.
HDFS architecture is as follows:
Hive data management
Well, finished above, talk about the main line, Hive, the main object of this pre-research is it. The figure has shown, hive is Hadoop's data warehouse infrastructure, so how to talk about things more image? To put it plainly is:
1 It is not stored data, the real data is stored on HDFS
2 It provides a SQL-like language that can provide external data storage, query and analysis of the interface, in short, than directly on Hadoop data, much more comfortable
As shown:
Next, analyze Hive's data management in three ways:
1 metadata storage
Hive metadata stored in the RDBMS, so usually you will find on the Internet to install hive need to install Mysql database or other RDBMS database, is the case, the metadata can not understand friends can understand it as the basic data Information, but does not contain actual data.
2 data storage
Hive has no special data storage format and does not index data. All data exists in HDFS. Since Hadoop is a batch system, the tasks are highly delayed and some time-consuming tasks are submitted and processed. So the data set that Hive processes in real time is very small, there will be a delay in the implementation process, so Hive's performance can not be compared with the traditional Oracle. In addition, Hive does not provide data sorting and query cache functions, does not provide online transaction processing, in other words, does not support direct docking with users, but suitable for offline processing. Of course, does not provide real-time query and record-level updates. Hive is suitable for batch tasks on a consistent, large-scale dataset (such as weblogs), with the greatest value being scalability, scalability, good fault tolerance, and a low-constraint data entry format.
hive architecture system as shown:
Thrift is a similar and ICE communication framework.
After the introduction of the above, I believe we Hadoop and hive applicable scenes have been clear at a glance, the specific need, depending on the specific business scenario ~
If there is the next one, will be Hadoop cluster build and hive instance introduction.
Original link: http: //www.cnblogs.com/HouZhiHouJueBlogs/p/3963848.html