Comparison of HBase and Oracle

Source: Internet
Author: User
Tags cassandra memcached couchdb value store

Transferred from: http://www.cnblogs.com/chay1227/archive/2013/03/17/2964020.html

Transferred from: http://blog.csdn.net/allen879/article/details/40461227

Transferred from: http://blog.itpub.net/28912557/viewspace-776770/

Due to the project needs, the original system upgrade needs to use HBase technology, after the use of the discovery, it is really good. So the question is, why use hbase here instead of the previous relational database Oracle, what are their characteristics and what are the different application scenarios? It's better to study with problems.

First look at the comparison between relational databases and NoSQL:

The relational database represents all of the data through the two-dollar representation of rows and columns.

The benefits of a relational database:

1. Maintain data consistency (transaction processing)

2. Due to standardization, the cost of updating data is small (the same fields are basically only one place)

3. Complex queries such as joins can be made

One of the best advantages of relational databases is the ability to maintain data consistency.

The lack of a relational database:

Not good at handling

1. Write processing of large amounts of data

2. Index or table structure (schema) changes for tables with data updates

3. Apply when field is not fixed

4. Handling of Quick return results for simple queries

--write processing of large amounts of data

Read and write are concentrated on a database overwhelmed by the database, most sites have used master-slave replication technology to achieve read and write separation to improve read and write performance and the scalability of reading library.

Therefore, the database master-slave mode is used when a large number of data operations are performed. Data writing is responsible for the main database, the data read in by the database is responsible for, can be relatively simple by increasing the scale from the database, but there is no easy way to write data to solve the scale problem.

First, if you want to scale the data write, you can consider the main database from one to two, as a mutual associated replication of the two-yuan primary database use, it is possible to reduce the load per primary database, but the update processing will conflict, may result in inconsistent data, in order to avoid such a problem, Each table request needs to be assigned to the appropriate primary database for processing.

Second, the database can be divided into separate, placed on different database servers, such as the different tables placed on different database servers, database segmentation can reduce the amount of data on each database server, in order to reduce the input of hard disk IO, output processing, to achieve high-speed memory processing. However, because the separate storage words on different server tables can not be join processing, the database segmentation needs to consider these issues in advance, after the database segmentation, if it is necessary to join processing, it must be in the program to associate, this is very difficult.

--indexing or table structure changes for tables with data updates

When using the relational database, in order to speed up the query needs to create an index, in order to increase the necessary fields must change the table structure, in order to do these processing, the table needs to be shared lock, during which data changes, updates, insertions, deletions and so on are not possible. If you need to take some time-consuming actions, such as creating an index on a table with a larger amount of data or altering its table structure, you need to pay special attention to the fact that data may not be updated over a long period of time.

--Application when the field is not fixed

If the field is not fixed, the use of relational database is also more difficult, some people will say, need to add a field to be able to, such a method is not not possible, but in the actual use of each time repeated table structure changes is very painful. You can also preset a large number of pre-fields, but in this case, it is easy to get rid of the corresponding state of the field and data, that is, which field holds the data.

--the processing of a simple query that requires a quick return of results ("simple" here refers to the absence of complex query conditions)

This is not a disadvantage, but in any case, the relational database is not good at quickly return the results of simple queries, because the relational database is the use of specialized SQL language for data reading, it needs to parse SQL and Vietnam, as well as locking and unlocking the table and so on, such as the additional cost, This is not to say that relational databases are too slow, but just to tell you that if you want to handle simple queries quickly, it is not necessary to use a relational database.

---------------------------

NoSQL Database

The relational database is widely used, and can perform complex queries such as transaction processing and table connection. In contrast, NoSQL databases are only used in specific areas and are largely non-complex, but they compensate for the shortcomings of the relational databases enumerated earlier.

Advantages:

Easy dispersion of data

The relationship between the data is the main reason for the name of the relational database, in order to join processing, the relational database has to store the data in the same server, which is not conducive to the dispersion of data, which is the relational database is not good at large data volume of the write processing reasons. In contrast, NoSQL databases do not support join processing, each data is designed independently, it is easy to spread the data across multiple servers, so reduce the amount of data on each server, even to deal with a large number of data write, it becomes easier, the data read into the operation of course also easy.

A typical NoSQL database

Temporary key-value storage (memcached, Redis), persistent key-value storage (ROMA, Redis), document-oriented database (MongoDB, CouchDB), column-oriented database (Cassandra, HBase)

One, key value storage

Its data is stored in the form of a key value, although it is very fast, but basically only through the key of the full consistent query to obtain data, according to the way the data can be divided into temporary, permanent and both of the three.

(1) Temporary

The so-called temporary is the data can be lost, memcached all the data in memory, so that the speed of saving and reading is very fast, but when the memcached stopped, the data will not exist. Data that exceeds the memory capacity cannot be manipulated because the data is kept in memory, and the old data is lost. To summarize, say:

。 Saving data in memory

。 Enables very fast save and read processing

。 Data is likely to be lost

(2) Permanent

The so-called permanent is the data will not be lost, here the key value of the store is to save data on the hard disk, compared with the temporary, because the inevitable to occur to the hard disk IO operation, so there is still a gap in performance, but the data will not be lost is its greatest advantage. To summarize, say:

。 Saving data on a hard disk

。 Very fast save and read processing possible (but not comparable to memcached)

。 Data is not lost

(3) both

Redis belongs to this type. Redis is special, temporary and permanent. Redis first stores the data in memory and writes the data to the hard disk when certain conditions are met (by default, more than 15 minutes, more than 10 in 5 minutes, and more than 10,000 keys in 1 minutes), which ensures the processing speed of the in-memory data. You can also write to the hard disk to ensure that the data is permanent, this type of database is particularly suitable for processing array type of data. To summarize, say:

。 Save data on both memory and hard disk

。 Enables very fast save and read processing

。 The data saved on the hard drive will not disappear (can be restored)

。 Suitable for handling data of array types

II. Document-oriented database

MongoDB, couchdb belong to this type, they belong to a NoSQL database, but differ from the key-value store.

(1) Do not define table structure

Even if you do not define a table structure, you can use it as you would define a table structure, and save the hassle of altering the table structure.

(2) Complex query conditions can be used

Unlike a key-value store, a document-oriented database can fetch data through complex query conditions, although it does not have the processing power of transactional and join relational databases, but other processing is basically possible outside of the initial process.

Third, column-oriented database

Cassandra, Hbae, and hypertable are of this type, and this type of NoSQL database is especially noticeable because of the explosive growth in data volumes in recent years.

The common relational database is to store the data in the behavioral unit, which is good at the reading processing of the behavior unit, such as the acquisition of the specific condition data. Therefore, the relational database is also a row-oriented database. In contrast, a column-oriented database stores data as a unit, and is adept at reading the data in columns.

A column-oriented database is extensible, and it is primarily used in situations where large amounts of data need to be processed, even if the data increase does not degrade the processing speed (especially the write speed). It is also useful to update large amounts of data as a batch program's memory. However, because the column-oriented database is very different from the current thinking mode of database storage, it is very difficult to apply.

Summary: relational databases and NoSQL databases are not opposites but complementary relationships, that is, using relational databases in general, and using NoSQL databases when it is appropriate to use NoSQL, to make up for the lack of a relational database for NoSQL databases.

HBase vs. Oracle comparison (column and row database)

1 Main differences

1.1, HBase suitable for a large number of insertion and read the situation

1.2, the bottleneck of hbase is hard drive transmission speed, the bottleneck of Oracle is hard drive seek time.

HBase essentially has only one operation, which is insert, whose update operation is to insert a row with a new timestamp, and the deletion is to insert a row with an insertion mark. The main operation is to collect a batch of data in memory, and then bulk write to the hard disk, so the speed of its writing depends mainly on the speed of the hard drive transmission. Oracle is different because he often has to read and write randomly, so the drive head needs to constantly look for data, so the bottleneck is the hard drive seek time.

1.3. HBase is a great place to look for the top n scenes sorted by time

1.4. Differences in behavior caused by different indexes.

1.5, Oracle can do both OLTP and OLAP, but in some extreme circumstances (the load is very large), it is not suitable.

2 Limitations of HBase:

1, can only do simple key value query, complex SQL statistics do not.

2, can only do a quick query on the row key.

3 row-style storage for traditional databases

In the case of data analysis, we often use a column as a query condition, and the returned results are often just some columns, not all columns. The I/O performance of the row database is poor in this case, Oracle, for example, has a large data file, in which a lot of blocks are divided, and then the rows are placed in each block, rows are put in one row, squeezed together, and the block is filled, Of course, some space will be reserved for future update. The disadvantage of this structure is that when we read a column, such as when we only need to read the red labeled column, we can't read this part of the data, I have to read the entire block into memory, and then take the data out of those columns, in other words, I have to read the rows of the columns in order to read some columns in the table. Before you can read these columns. If the data of these columns is very small, such as the 1T data only accounted for 100M, in order to read 100M data but to read 1TB data into memory, it is obviously not cost-effective.

3.1 B + Index

The data access technology used in Oracle is primarily a B-number index:

From the tree and the node, you can find the leaf node, which records the key value corresponding to the position of the row.

operation on B-Tree:

B-Tree insertion-split node

b number Delete--merge node

4-Column Storage


The same column of data will be squeezed together, for example, squeezed in the block, when I need to read a column, the value needs to read the relevant files or blocks in memory, the entire column will be read out, so I/O will be much less.

The format of the data in the same column is similar, so you can do a large compression. This saves storage space and I/O, because the data is compressed so that the amount of data read is less.

A row database is suitable for OLTP, whereas a column database is not suitable for OLTP.

4.1 BigTable's LSM (Log Struct Merge) index

In HBase the log is the data, the data is the log, they are integrated. Why do you say that because the update of hbase inserts a row, delete is also inserted into a row, and then hit the delete tag, is not the log?

In HBase, there is the memory store, and the store file, in fact each memory store and each store file is a B + tree attached to each column family (a bit like Oracle's Index organization table, data and index is integrated), That is, the following is the column family, above the B + tree, when the data query, the first in memory store in the B + tree to find, if not found, and then to the store file to find.

If the data for a row is scattered across several columns, how do you find the data for the row? Then you need to find several B + trees, which is less efficient. So try to make each insert row of the column family is sparse, only one column family has a value, the other column family has no value,


One, the difference in behavior caused by different indexes
HBase can only establish a primary key index, and the subsequent data query can only be based on the index for a simple key-value query;
However, Oracle can create arbitrary indexes, and can query data by any column.

Two, hbase is suitable for a large number of insertions and read the case, read the general Key-value query
Big data, high concurrency, HBase's appetite

Three, the bottleneck of hbase is hard drive transmission speed, Oracle bottleneck is hard drive seek time
HBase is a large number of hard disk to write data (no delete, update, are insert), even reading data, is also a priority memstore, so the speed of hard disk transmission becomes its bottleneck;
Oracle has random access characteristics (SELECT, update, etc.), so the hard drive seek time becomes its bottleneck, and the seek time is mainly determined by the rotational speed.

Four, HBase is good for finding scenes that sort top N by time
Because HBase data has a timestamp (HBase has a timestamp by default)

Row-Type storage:

Row-Type storage:
Data stored in data files
Basic unit of data file: Block/page (one row after row exists in block, of course block does not fill, reserved space for the operation of the row, such as: Update)
Block structure: size, data area
To select the red-orange column, the row database adds the entire block to the memory and then filters out the required columns.
In the case of HBase, because of data storage characteristics, the data is stored in a family of columns, and a file block stores data from the same column family.
In this way, queries are much more optimized than row-type databases.

In addition, because in HBase, the data format of the same column is close, or similar in length, so that the data can be greatly compressed,
As a result, it saves hard disk space and also reduces IO

Comparison of HBase and Oracle

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.