Reprint Please specify source: Jiq's technical Blog
HBase Features:
Recently in learning HBase. HBase is indexed based on row health, and queries are fast and in real time.
However, HBase is queried based on fields other than row health. Then it can only be a full-fledged scan, which is basically unacceptable.
So HBase will generally be designed for detailed application scenarios, using the real-time nature of the query based on row health to achieve real-time query of hbase data.
A relational database is based on a real-time query of indexed fields:
Then associate to the relational SQL database, they set up the b/b+/b-tree index for the primary key, the query based on the primary key is real-time. The range scan is also real-time.
More importantly, the relational database can also be easily indexed for real-time queries based on other fields that are non-primary keys.
Two-level index of HBase:
So a NoSQL database like HBase is not able to index fields other than row Jian, so that queries for these fields get real-time results?
The answer is yes.
In HBase, indexing a field other than a row is called a "level two index."
The usual way to build a two-level index is to take advantage of the "coprocessor" in HBase, which mainly includes the observer and endpoint two modes, which resemble the trigger trigger in a relational database, which resembles a stored procedure in a relational database. By observer the user code can be embedded in the existing execution process, at a specific time occurs when the corresponding user code, that is, the callback function.
There are now three types of observer interfaces:
1, Regionobserver: Provide the corresponding data manipulation event hooks, such operations have get, Put, Delete, Scan and so on.
Each region will have a regionobserver instance.
2. Walobserver: Provides hooks for operation of pre-write log (Write-ahead Log,wal). Walobserver is executed during the Wal process, with one instance per region server.
3, Masterobserver: Provides the DDL operation Hook, this kind of operation has the creation table, deletes the table, the Change table meta-information and so on. Masterobserver is executed on the HBase master.
Each class of observer can be loaded into multiple. All observer are run in chained order.
HBase community There are three main scenarios for using the coprocessor framework to build a two-level index:
1, based on Walobserver in an index table to generate the index, through the column truncation write Ahead log write operation, extract the keyvalue to write Hlog information, the corresponding information stored in the Index table.
2, based on the regionobserver in a region to maintain an index column family, by intercepting the put, delete, and other operations, extract the corresponding information stored in the same region of the index column family, such a way index is a local index. Full ordering is not supported.
3. Build an index within an index table based on Regionobserver. The corresponding information is stored in the index table by means of the put, delete and other actions of the column section.
At this point, you may want to know what the index table looks like. In general, suppose that one of the HBase tables has a column called the region, which stores the city where the user is currently located, such as Nanjing. Shanghai so. Then to create a corresponding two-level index for some, when inserting a new row of records. The value of this column is extracted from the coprocessor's hook function Observer, and a row of index records is inserted in the corresponding index table. The rowkey of this row index record is the value extracted from the column, and the Rowkey of the row index record is the real rowkey of the record that was just inserted, so that in the retrieval area, for example, the user enters a "Nanjing". Or run an hbase-based SQL statement: SELECT * from usertable where location= ' Nanjing ', it will be detected to location This column has been created with a level two index, so the index table will look for the Rowkey ' Nanjing ' 's record. Then take out the real rowkey, take the corresponding record in the corresponding data table and show it to the user.
There are many similarities between relational SQL database and NoSQL data for non-primary key indexes.
Secondary index and full-text search:
But one thing to note is that two-level indexes and full-text indexing are two different concepts:
A two-level index is usually created so that it can be queried in real time. And the purpose of full-text search is to be able to quickly find their own content in the database. The former is a way of completely matching a column of values. The latter is the need to participle the data, the result of the word "inverted index", so as to support the specific keyword to find its own part of the content (or the document/record belonging).
Full-Text Search in HBase:
So what does full-text retrieval in HBase generally do?
Prior exposure to SOLR. It is a well-encapsulated library above the lucence, which supports the full-text retrieval service of distributed cluster mode. The principle is simply to be a full-text retrieval of the contents of the field of Word segmentation, and then the results of the word segmentation to establish an inverted index.
Full-Text search is implemented in HBase. It is generally possible to combine LUCENCE/SOLR to make use of their own initiative to create an index for SOLR. Nothing more than its input source is different, it may have been generally for the relational database, now the input source into hbase.
Full-Text search is not real-time, not that you insert a record into hbase, immediately you can query in real-time in a faster time, because the background to build the index is a process. In general it may be a few minutes or so.
Full-Text search in a relational database:
After learning about HBase's full-text search, it is natural to associate full-text retrieval in a relational database.
The full-text retrieval of the relational database and HBase is almost the same as in fact.
The popular approach is also implemented using a fully-fledged open source full-text retrieval library such as SOLR. The relational database is used as the data input source for SOLR, which periodically extracts data from the relational database on its own initiative. or triggered by the Insert/delete/update operation of the relational database, to create an inverted index in SOLR for the result of the specified field's Word segmentation.
In addition, there are some of the more famous commercial full-text search engine, our company is now using this.
A fuzzy query in a relational database like:
When thinking about queries, it is very difficult to think of fuzzy queries in relational databases.
In my opinion, a fuzzy query looks like a full-text search for a field, and in fact it is more powerful than that in terms of functionality (not talking about efficiency). For example, you might use the full-text search engine for the phrase "I am a Jiyichin." Participle will be divided into "I", "yes". "Jiyichin" three words (the actual word breaker may not be the case.) I just give a sample). This time your full text search can only be searched for "I", "yes", "Jiyichin" three words, if you want to search "Yi Qin" then you will not be able to search.
But the fuzzy query in the relational database, if you use "%" to query the absolute can query.
Why is it?
I understand that the fuzzy query in the relational database does not establish the index of the whole field for the field of the fuzzy query, and then the index of the word segmentation results after the word segmentation, but after getting the query condition of "% Yi Qin", the whole table is scanned and matched to this regular table.
Even though you put a wildcard in front. such as "like", this query will certainly not go through the index, but the full table scan, but suppose your fuzzy query is "like season%". After the wildcard is placed in front, then the query is likely to go index, and the wildcard placed in front of the query in fact there are some techniques for detailed optimization. For example, look at this sample: http://blog.csdn.net/firstboy0513/article/details/6912632.
In most relational databases, it is actually possible to create full-text indexes on the specified fields (note that this is not a third-party full-text search engine), as in MySQL:
MATCH (Col1,col2,...) Against (Expr[search_modifier])
Search_modifier: {in BOOLEAN MODE | Withquery EXPANSION}
For example: SELECT * from Tab_name WHERE MATCH (col1,col2) against (Search_word);
Table in this case should be a MyISAM type. Col1, col2 must be char, varchar, or text type, you need to establish a full-text index on col1 and col2 before querying.
Here are some things to note:
MySQL in the case of high concurrent connections, the number of database records, SELECT ... WHERE ... Like '%...% ' full-Text search method is not only inefficient, and in the wildcard% and _ at the beginning of the query, the use of the index, the need for a full table scan, the pressure on the database is also very large. MySQL provides a full-text indexing solution for this issue, which not only improves performance and efficiency (because MySQL indexes these fields to refine the search). and a higher quality search is achieved. But. So far. MySQL does not correctly support Chinese full-text indexing.
---This problem can be solved by using some MySQL Chinese full-text search plugin.
For full-text search in Oracle, you can refer to this article: http://www.iteye.com/topic/1118055.
This article is personal opinion, if you have any questions or corrections, welcome to discuss, thank you!
Copyright notice: This article blog original articles, blogs, without consent, may not be reproduced.
About Sql/nosql Database search/Think query