Comparison of HBase and Oracle __oracle

Source: Internet
Author: User
Tags cassandra memcached mongodb redis couchdb value store

Turn from: http://www.cnblogs.com/chay1227/archive/2013/03/17/2964020.html

Turn from: http://blog.csdn.net/allen879/article/details/40461227

Turn from: http://blog.itpub.net/28912557/viewspace-776770/


As a result of the project needs, the original system upgrade needs to use the HBase technology, after the use of the discovery, indeed very good. So the question is, why use hbase here instead of the previous relational database Oracle, what are their characteristics and how the scenarios are different. It is better to learn with questions.


First look at the relationship between the database and NoSQL comparison:

The relational database shows all the data through the two-dollar representation of rows and columns.

Advantages of relational databases:

1. Maintain data consistency (transaction processing)

2. Due to standardization, the cost of data updates is very small (the same fields are basically one place)

3. Can be a join and other complex queries

The ability to maintain data consistency is the biggest advantage of relational databases.

Lack of relational databases:

Not good at the deal

1. Write processing of large amounts of data

2. Indexing or table structure (schema) changes for tables with data updates

3. Application when the field is not fixed

4. For simple queries need to quickly return the results of the processing

--write processing of large amounts of data

Read-write concentration on a database overwhelmed the database, most sites have been using master-slave replication technology to achieve read-write separation to improve read-write performance and read Library scalability.

Therefore, in a large number of data operations, the use of database master-slave mode. The data written by the main database is responsible for the reading of data from the database is responsible for more simply by increasing from the database to achieve large-scale, but the data is written there is no simple way to solve the problem of scale.

First, to make the data written in a large scale, consider adding the primary database from one to two, as a two-dollar master database that is replicated with each other, so that you can reduce the load per primary database by half, but the update processing will conflict and may result in inconsistent data, in order to avoid such problems, Requests for each table are assigned to the appropriate primary database to be processed separately.

Second, can consider to separate the database, on the different database servers, such as the different tables on different database servers, database segmentation can reduce the amount of data on each database server, in order to reduce the hard disk IO input, output processing, to achieve high-speed memory processing. However, because the separate storage words on different servers can not be a join processing, database segmentation needs to consider these issues in advance, database segmentation, if you must do join processing, it must be in the program to associate, it is very difficult.

--index or table structure changes for tables with data updates

In the use of relational databases, in order to speed up the query speed needs to create the index, in order to add the necessary fields must change the table structure, in order to do these processing, the table needs to share the lock, the data change, update, INSERT, delete, etc. are not possible. If you need to do some time-consuming operations, such as indexing a table with a large amount of data or changing its table structure, you need to pay special attention to the fact that data may not be updated for long periods of time.

--The application of the field is not fixed

If the field is not fixed, the use of relational database is also more difficult, some people will say, when needed to add a field on it, such a method is not not, but in the actual use of repeated changes in the table structure is very painful. You can also set up a large number of preliminary fields beforehand, but in this case, it's easy to get rid of the corresponding state of the field and data, that is, which field holds the data.

--dealing with simple queries that require quick return of results ("simple" here means no complex query criteria)

This is not a disadvantage, however, relational databases are not good at quickly returning results to simple queries, because relational databases are read using a specialized SQL language, which requires parsing of SQL and Viet Nam, as well as additional overhead such as locking and unlocking tables. This is not to say that relational databases are too slow, but just to tell you that you do not need to use a relational database if you want to process a simple query at high speed.

---------------------------

NoSQL Database

Relational database is widely used and can be used for complex queries such as transaction processing and table connection. In contrast, the NoSQL database is only used in a specific domain, basically does not carry on the complex processing, but it just compensates the previously enumerated relational database the shortcoming.

Advantages:

Easy to scatter data

The relationship between the various data is the main reason for the name of the relational database, in order to join processing, the relational database has to store the data in the same server, which is not conducive to the spread of data, which is not good at the relational database is not adept at the large amount of data write processing. Conversely, the NoSQL database does not originally support join processing, each data is independently designed, it is easy to spread the data on multiple servers, so reduce the amount of data on each server, even to deal with a large number of data writes, also become easier, the data read into the operation of course is also easy.

A typical NoSQL database

Temporary key-value stores (memcached, Redis), persistent key-value stores (ROMA, Redis), document-oriented databases (MongoDB, CouchDB), column-oriented databases (Cassandra, HBase)

One, the key value store

Its data is stored in the form of a key value, although it is very fast, but basically only through the key of the full consistent query to obtain data, according to the data storage can be divided into temporary, permanent and both three kinds.

(1) Temporary

The so-called temporary is that the data is likely to be lost, memcached all the data in memory, so save and read very quickly, but when the memcached stopped, the data will not exist. Because the data is stored in memory, the data that exceeds the memory capacity cannot be manipulated, and the old data is lost. In summary:

。 Saving data in memory

。 Can be very fast to save and read processing

。 Data is likely to be lost

(2) Permanent

The so-called permanent is the data will not be lost, the key store here is the data stored on the hard disk, compared with the temporary, due to the inevitable to the hard disk IO operation, so there is still a gap in performance, but the data will not be lost is its biggest advantage. In summary:

。 Save data on a hard disk

。 Very fast Save and read processing (but not compared to memcached)

。 Data is not lost

(3) both

Redis belong to this type. Redis are special, temporary and permanent. Redis first saves the data in memory and writes it to the hard disk when it satisfies certain conditions (the default is 15 minutes or more, more than 10 in 5 minutes, and 10,000 or more keys change in 1 minutes), which ensures the processing speed of the data in memory, It is also possible to write hard disks to ensure that the data is permanent, and this type of database is particularly suitable for handling data of array types. In summary:

。 Save data on both memory and hard drives at the same time

。 Can be very fast to save and read processing

。 The data saved on the hard disk will not disappear (can be recovered)

。 Data that is appropriate for handling array types

Second, document-oriented database

MongoDB, couchdb belong to this type, they belong to the NoSQL database, but they are different from the key value stores.

(1) Do not define table structure

Even if you do not define a table structure, you can use it as if you were defining a table structure, which eliminates the hassle of altering the table structure.

(2) You can use complex query conditions

Unlike key-value storage, document-oriented databases can obtain data through complex query conditions, although they do not have the processing power of transactional and join relational databases, but the initial processing is largely achievable.

Three, column-oriented database

Cassandra, hbae and hypertable belong to this type, and because of the explosive growth of data volume in recent years, this type of NoSQL database is especially noteworthy.

The ordinary relational database is the behavior unit to store the data, is good at the behavior unit's reading processing, for instance the specific condition data obtains. Therefore, a relational database is also a row-oriented database. In contrast, a column-oriented database is used to store data as a unit, and is good at reading data in columns.

Column-oriented databases are extensible, and even if the data increase does not reduce the processing speed (especially the write speed), it is mainly applied to situations where large amounts of data need to be processed. In addition, it is also useful to update a large amount of data as a memory of the batch process. However, the application of the column-oriented database is very difficult because of the different thinking mode of the current database storage.

Conclusion: relational database and NoSQL database are not opposite but complementary relationship, that is to use the relational database under normal circumstances, use NoSQL database when suitable for using NoSQL, let NoSQL database to make up for the insufficiency of relational database.


HBase and Oracle Comparisons (column and row databases)

1 Main differences

1.1, HBase suitable for a large number of inserts at the same time read the situation

1.2, HBase bottleneck is hard drive transmission speed, Oracle's bottleneck is hard drive seek time.

HBase essentially has only one operation, insert, whose update operation is to insert a row with a new timestamp, and the deletion is to insert a row with an insert tag. The main operation is to collect a batch of data in memory, and then write to the hard disk in bulk, so the speed of its writing depends on the speed of the hard drive transmission. Oracle is different because he often reads and writes randomly, so the hard disk head needs to keep looking for the data, so the bottleneck is the hard drive seek time.

1.3, HBase is very suitable to find a time to sort top N scene

1.4, the index of different behavior caused by the difference.

1.5, Oracle can do both OLTP and OLAP, but in some extreme case (load is very large), it is not suitable.


2 Limitations of HBase:

1, can only do a simple key value query, complex SQL statistics do not.

2, only on the row key to do a quick query.

3 Traditional database for row-type storage

In the data analysis scenario, we often use a column as a query condition, and the results returned are often just some columns, not all columns. The I/O performance of the row database is poor in this case, Oracle, for example, will have a large data file in which a number of blocks are divided, then rows are placed in each block, rows are placed in one line, squeezed together, and then blocks are stuffed, Of course, some space will be reserved for future update. The disadvantage of this structure is that when we read a column, for example, when we just need to read a red-flagged column, instead of just reading this part of the data, I have to read the entire block into memory and then take the data out of those columns, in other words, I have to read the entire row of columns in order to read some columns in the table. Before you can read these columns. If these columns have very little data, such as only 100M in 1T of data, it is clearly not cost-effective to read 1TB data to memory in order for the 100M data to be readable.

3.1 B + Index

The data access technology used in Oracle is mainly a B-number index:

From the tree's node, you can find the leaf node, which records the position of the line where the key value corresponds.

Operations on the B-tree:

B-Tree Insert--Split node

B-Number deletion--merging nodes

4-Column Storage


The same column of data will be squeezed together, such as squeezed in block, when I need to read a column, the value needs to read the relevant files or blocks in memory, the entire column will be read out, so I/O will be much less.

The data in the same column is similar in format, so you can do a large compression. This saves storage space and also saves I/O because the data is compressed so that the amount of data read is less.

The row database is suitable for OLTP, but the column database is not suitable for OLTP.

4.1 BigTable's LSM (Log Struct Merge) Index

In the HBase log is the data, the data is the log, they are integrated. Why do you say that, because the hbase of the update when inserting a row, delete is inserted into a row, and then hit the delete tag, is not the log.

In HBase, there are memory store and store file, in fact each memory store and each store file is attached a B + tree to each column family (a bit like an Oracle Index organization table, data and index is integrated), That is, the following is the column family, which is a B + tree, when the data query, first in the memory of the memory store in the B + tree to find, if not found, and then go to the store file to find.

      If the data for a row is scattered among several rows of people, how do you find the data for the row? Then you need to find several B + trees, which is less efficient. So try to make each insert a row of the column family is sparse, only one column family has a value, the other column family has no value,
One, the difference between the index of the behavior caused by
HBase can only establish a primary key index. And then the data query can only be based on the index for a simple key-value query;
But Oracle can establish arbitrary indexes or query data in any column.

Two, hbase suitable for a large number of inserts at the same time read the situation, read generally Key-value query
large data, high concurrency hbase the appetite

Three, hbase bottleneck is hard drive transmission speed, Oracle bottleneck is hard drive seek time
HBase is a large number of write data to the hard disk (no delete, update, are inserts), even read data, is also a priority memstore, so hard drive transmission speed becomes its bottleneck;

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.