Thinking about the search/query of sql/nosql database

Last Update:2014-07-22 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint Please specify source: Jiq's technical Blog

HBase Features:

Recently in the learning Hbase,hbase based on the row Jian is indexed, the query speed will be very fast, completely real-time.

But if HBase is to query based on fields other than row health, it can only be a full-fledged scan, which is basically unacceptable.

So HBase will generally be designed for specific scenarios to design the health, the use of the real-time query based on row health to achieve real-time HBase data query.

A relational database is based on a real-time query of indexed fields:

Then associate to the relational SQL database, they set up the b/b+/b-tree index for the primary key, the query based on the primary key is real-time, and the range scan is real-time.

More importantly, the relational database can be easily indexed for real-time queries based on other fields that are non-primary keys.

Two-level index of HBase:

So can hbase, a NoSQL database, also be indexed for fields other than row health, so that queries for those fields get real-time results?

The answer is yes.

In HBase, indexing a field other than a row is called a "level two index."

The usual way to build a two-level index is to take advantage of the "coprocessor" in HBase, which mainly contains the observer and endpoint two modes, which resemble the trigger trigger in a relational database, which resembles a stored procedure in a relational database. The user code can be embedded into the existing running process through observer, and the corresponding user code, called the callback function, will be triggered at a certain time. There are currently three types of observer interfaces:

1. Regionobserver: Provides hooks for data manipulation events, such as Get, Put, Delete, Scan, and so on. Each region will have a regionobserver instance.

2. Walobserver: Provides hooks for operation of pre-write log (Write-ahead Log,wal). Walobserver runs during the Wal process, and each region server has an instance.

3, Masterobserver: Provides the DDL operation Hook, this kind of operation has to create the table, deletes the table, modifies the table meta-information and so on. The masterobserver runs on the HBase master.

Each class of observer can load multiple, and all observer are executed in chained order.

HBase community There are three main scenarios for using the coprocessor framework to build a two-level index:

1, based on Walobserver in an index table to generate the index, through the column truncation write Ahead log write operation, extract the keyvalue to write Hlog information, the corresponding information stored in the Index table.

2, based on Regionobserver in a region to maintain an index column family, by intercepting the put, delete, and other operations, extract the corresponding information stored in the same region of the index column family, this way the index is a local index, does not support full ordering.

3, based on Regionobserver in an index table generated index, through the column section of the put, delete and other operations, extract the corresponding information stored in the Index table.

At this point, you may want to know what the index table looks like, in general, if one of the HBase tables has a column called the region, the store is the user's current city, such as Nanjing, Shanghai, so to create a corresponding two-level index, when inserting a new row of records, In the coprocessor hook function Observer will extract the value of this column, in the corresponding index table to insert a row of index records, the row index record Rowkey is the value extracted from the column, and the row index record Rowkey is the record just inserted the real rowkey, This is done in a retrieval region, such as a user entering a "Nanjing", or an hbase-based SQL statement: SELECT * from usertable where location= ' Nanjing ', will be detected location This column has been created a two-level index, so will be found in the index table Rowkey ' Nanjing ' record, and then take out the real rowkey, in the corresponding data table to take the corresponding record, display to the user.

There are many similarities between relational SQL database and NoSQL data indexing for non-primary keys.

Secondary index and full-text search:

But one thing to note is that two-level indexes and full-text indexing are two different concepts:

A two-level index is generally designed to be able to query in real time, and the purpose of the full-text search is to quickly find the content of interest in the database. The former is a way of exactly matching a column of values, and the latter is the need to hyphenate the data, creating an ' inverted index ' to the result of the word segmentation, which can support the search for the content (or the document/record) to which it belongs according to a particular keyword.

Full-Text Search in HBase:

So what does full-text retrieval in HBase generally do?

Prior exposure to SOLR, is a layer of packaged libraries above the lucence, supporting the construction of a distributed cluster approach to full-text retrieval services. The principle is simply to be a full-text retrieval of the contents of the field of Word segmentation, and then the results of the word segmentation to establish an inverted index.

HBase in order to achieve full-text retrieval, generally can be combined with LUCENCE/SOLR to do, the use of their auto-create INDEX function, for SOLR, is nothing more than its input source is different, previously probably for the relational database, the input source has now become hbase.

Full-Text search is not real-time, not that you insert a record into hbase, you can immediately in real time to query in a faster time, because the background to build the index is a process, in general, may be a few minutes or so.

Full-Text search in a relational database:

After learning about HBase's full-text search, it is natural to associate full-text retrieval in a relational database.

The full-text index of a relational database is almost identical to hbase.

The more popular way is to adopt and SOLR such a mature open-source full-text retrieval library implemented, the relational database as SOLR's data input source, they will periodically automatically extract data from the relational database, or by the relational database Insert/delete/update operation triggered, To create an inverted index in SOLR for the result of a specified field word breaker.

In addition, there are some more famous commercial full-text search engine, our company is currently using this.

A fuzzy query in a relational database like:

When thinking about queries, it is difficult to think of fuzzy queries in relational databases.

Personally, the fuzzy query seems to be similar to a full-text search for a field, in fact, in terms of function (not talking about efficiency) than this more powerful, such as you for "I am Jiyichin" This sentence, you may use the full-text search engine, participle will be divided into "I", "yes", "Jiyichin" Three words (the actual word breaker may not, I just give an example), this time you can only search the full text of the "I", "yes", "Jiyichin" three words, if you want to search "Yi Chin" then you will not be able to search.

But in the relational database of fuzzy query, if you use "%" to query the absolute can query.

Why is it?

I understand that the fuzzy query in the relational database does not establish the index of the whole field for the field of the fuzzy query, and then the index of the word segmentation results after the word segmentation, but after getting the query condition of "%", the whole table is scanned , and the regular expression is matched.

Although you put the wildcard in front, such as "like the righteous chin" so, this query will certainly not go index, but the full table scan, but if your fuzzy query is "like season%", the wildcard is placed in front, then this query is likely to go index, And the wildcard in front of the query actually has a few tricks to do the specific optimization, such as look at this example: http://blog.csdn.net/firstboy0513/article/details/6912632.

In most relational databases, it is possible to set up full-text indexing for the specified field (note that this is not a third-party full-text search engine), such as in MySQL:

MATCH (Col1,col2,...) Against (Expr[search_modifier])

Search_modifier: {in BOOLEAN MODE | Withquery EXPANSION}

For example: SELECT * from Tab_name WHERE MATCH (col1,col2) against (Search_word);

The table here needs to be of type MyISAM, col1, col2 need to be char, varchar, or text type, and a full-text index on col1 and col2 needs to be established before querying.

Here are some things to note:

MySQL in the case of high concurrent connections, the number of database records, SELECT ... WHERE ... The full-text search method of like '%...% ' is not only inefficient, but also with the wildcard character% and _ beginning as the query, the use of the index, the need for full table scan, the pressure on the database is very large. MySQL provides a full-text indexing solution for this issue, which not only improves performance and efficiency (because MySQL indexes these fields to optimize search), but also enables higher-quality searches. However, so far, MySQL does not correctly support Chinese full-text indexing.

---This problem can be solved by using some MySQL Chinese full-text search plugin.

Refer to this article for full-text search in Oracle: http://www.iteye.com/topic/1118055.

This article belongs to personal opinion, if there is doubt or correction, welcome to discuss, thank you!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More