There are many broad areas of application for SQL on the Hadoop engine:
- Data processing and on-Line Analytical processing (OLAP)
- Improved optimization
- Online transaction processing (OLTP)
Storage Engine:
Today there are three main storage engines for Hadoop: Apache HBase, Apache Hadoop HDFS, and Hadoop Accumulo. Apache Accumlo is very similar to Hbase, but it is a project created by the NSA organization, historically with a particular emphasis on system security, especially in the area of authorization authentication; Hbase has now added security features to the project, in our view, In this case, the Accumulo is no longer discussed further.
HBase is a distributed key-value storage system, where data is stored in a well-ordered map form, which means that the data is sorted by the key column, as we will describe below, the hbase typical use case is OLTP application, HDFS is a file system and is able to store a large amount of data collection in a distributed manner;
HBase stores data stored in HDFS in hfile format, which is not configurable. When using HDFS directly without HBase, the user is required to select a file format;
There are a number of points to consider when choosing a file format, for example,
- What is the main reading mode? Read all rows, or only a subset of all rows of data;
- Is the data also likely to contain non-ASCII text content?
- Which tools read and write this data (Apache Hive, Spark?);
Broadly speaking, there are two file formats for use with HDFS, which are columnar and row-wise. Columnar formats such as Rcfile, ORC, and Apache parquet can provide extreme compression performance (compression via a variety of coding methods like stroke encoding), while providing high read performance in scenarios where only a small number of columns in the row are read , such as your row of data has 50 to 100 columns, but only need to read seven or eight columns of the occasion;
Row-wise formats, such as restricted text formats, Sequence file formats, and Apache Avro formats, which do not provide effective compression, are more suitable for business scenarios that require reading most of the columns in a table, and that data is streamed in a way that A business scenario that leads to a table in small batches at a time;
We recommend excluding text formats, rcfile and Sequence files, because they are historically legacy file formats, and are not recommended because they have potential anomalies when integrating historical system data. We do not recommend using these formats because they are prone to text collisions (such as non-ASCII text), poor performance, and few tools other than text to read them;
Once we answered the question of choosing columnar or row-wise, and ruled out the file formats left over from history, the most important question became which tools and engines were able to read and write the data, and a large number of Hadoop eco-chain tools and engines had integrated Avro and parquet Project, where ORC is the best-performing Apache Hive file format.
Online transaction processing
Apache HBase Projects provide OLTP-type operations and are highly scalable, and HBase is the only Hadoop submodule that is typically used for online user data storage, but the HBase project is not intended to be a relational database management system, nor is it intended to replace MySQL, Oracle or DB2 This kind of relational database, in fact HBase itself does not provide any SQL interface, and users must also use Java, Python, or Ruby programming to store and retrieve data;
Apache Phoenix Project goal is to provide OLTP type based on Apache HBase Sql,phoenix allows users to perform insert update and delete operations based on the HBase data model, but as mentioned earlier, the HBase data model is fundamentally different from the traditional off- System database, so HBase plus Phoenix is still not a replacement for relational databases;
HBase (and Phoenix) projects are useful for business scenarios that are based on RDBMS and have trouble applying extensions, and a traditional solution for traditional relational databases is horizontal partitioning, but this solution often suffers from the following pitfalls:
- Cross-shard transactions are not supported
- A complex and expensive re-sharding process is required to increase the level of machine expansion.
Like a partitioned database, HBase does not support transactions, but it is much easier for the hbase system to increase the level of machine scaling and load rebalancing within hbase.
New nodes can be added to the HBase cluster, HBase can automatically allocate data shards to different nodes, and HBase wins with the ability to provide easy-to-increase machine-level scalability if both the Shard database and HBase are assumed to be missing, and some companies are already doing the underlying use of HBase The architecture is based on the upper layer to add transactional SQL support products, such as Splice Machine Company.
Storage engine and online transaction processing for Hadoop systems