Thoughts on SQL/NoSQL database search/query, sqlnosql

Source: Internet
Author: User
Tags solr couchdb

Thoughts on SQL/NoSQL database search/query, sqlnosql

Reprinted with the source: jiq •'s technical Blog

Hbase features:

Recently, we have been studying Hbase. Hbase has created an index based on Xingjian, and the query speed is very fast and completely real-time.

However, if Hbase needs to query data based on fields other than Jian, it can only be a full scan, which is basically unacceptable.

Therefore, Hbase is designed for specific application scenarios, and uses the real-time performance of health-based queries to achieve real-time query of Hbase data.

 

Real-Time query of relational databases based on index fields:

Then I think of the relational SQL database. They built a B/B +/B-tree index for the primary key. The query based on the primary key is real-time, and the range scan is also real-time.

More importantly, based on other fields with non-primary keys, relational databases can easily create indexes for them to achieve real-time query.

 

Hbase secondary index:

So can NOSQL databases like Hbase also create indexes for fields other than Xingjian, so as to query these fields to achieve real-time results?

The answer is yes.

In Hbase, index creation for fields other than health is called "secondary index ".

 

Generally, secondary indexes are created using the "coprocessor" in Hbase"The coprocessor mainly includes the Observer and Endpoint modes. The former is similar to the trigger in a relational database, and the latter is similar to the stored procedure in a relational database, the Observer can be used to embed user code into the existing running process. When a specific time occurs, the corresponding user code is triggered, that is, the callback function. There are currently three Observer interfaces:

1. RegionObserver: Provides hooks for data operation events, including Get, Put, Delete, and Scan. Each region has a RegionObserver instance.

2. WALObserver: Provides hooks for write-ahead log and WAL operations. WALObserver runs during WAL processing. Each region server has an instance.

3. MasterObserver: Provides hooks for DDL operations, including creating tables, deleting tables, and modifying table metadata. MasterObserver runs on HBase master.

Each type of observer can be loaded multiple times, and all observers are executed in chained order.

 

There are three solutions for using the Coprocessor framework to create secondary indexes in the Hbase community:

1. An index is generated in an index table based on WALObserver. The Write Ahead Log Write operation is intercepted in the column to extract the KeyValue pair information written into the HLog and store the corresponding information in the index table.

2. Maintain an index column family in a Region based on RegionObserver. By blocking put and delete operations of Region, extract the relevant information and store it in the same Region index column family, in this way, the index is a local index and does not support full sorting.

3. An index is generated in an index table based on RegionObserver. The put and delete operations of Region are intercepted in the column to extract the corresponding information and store it in the index table.

In this case, you may want to know what the index table looks like. Generally, if a column in an Hbase table is called the region, it stores the current city of the user, such as Nanjing, in this case, we need to create corresponding secondary indexes for some data. When a new row of records is inserted, in the coprocessor hook function Observer, the value of this column is extracted, and a row of index records is inserted in the corresponding index table, the rowkey of this row of index records is the value extracted from this column, and the rowkey of this row of index records is the real rowkey of the record just inserted, so that in the retrieval area, for example, if you enter a "Nanjing" or execute an Hbase-based SQL statement: select * from userTable where location = 'nanjing, the second-level index has been created for the location column, so the record with rowkey 'nanjing 'is searched in the index table, and the real rowkey is retrieved, extract the corresponding records from the corresponding data table and display them to the user.

 

There are still many similarities between the establishment of non-primary key indexes for relational SQL databases and NoSQL data.

 

Secondary index and full-text search:

However, note that secondary indexes and full-text retrieval are two different concepts:

A secondary index is generally created for real-time queries. The purpose of full-text search is to quickly find the content you are concerned about in the database. The former is a method that completely matches the values of a column, and the latter is to split the data and create an 'inverted Index' for the results of word segmentation ', this allows you to search for the content (or the document/record) of a specific keyword ).

 

Full-text search in Hbase:

How does Hbase perform full-text search?

Solr is a encapsulated library on Lucence that supports full-text retrieval in Distributed cluster mode. The principle is to simply perform word segmentation for the content of the full-text search field, and then create inverted indexes for the results after word segmentation.

To implement full-text search in Hbase, it can be done by using Lucence/Solr and their automatic index creation function. For Solr, the input source is different, previously, it may be intended for relational databases. Now the input source is changed to Hbase.

 

Full-text retrieval is not real-time. It does not mean that you insert a record into Hbase. You can immediately query the record in real time within a fast period of time, because index creation in the background requires a process, generally, it may take about a few minutes.

 

Full-text search in relational databases:

After learning about full-text retrieval of Hbase, we naturally think of full-text retrieval in relational databases.

Full-text retrieval of relational databases is almost the same as that of Hbase.

The popular method is to combine it with a mature open-source full-text search Library such as Solr to use relational databases as Solr data input sources. They will automatically extract data from relational databases on a regular basis, or, it is triggered by the insert, delete, or update operation of the relational database to create an inverted index for the result of the specified field segmentation in Solr.

In addition, there are some well-known commercial full-text search engines in China, which our company currently uses.

 

Like:

When thinking about queries, it is difficult to think of fuzzy queries in relational databases.

In my opinion, fuzzy search seems to be similar to full-text search for a certain field. In fact, fuzzy search is more powerful than full-text search in terms of functions (not to mention efficiency, for example, if you use the full-text search engine for the phrase "I am Ji yiqin", you may divide the word segmentation into "I" and "yes ", the word "Ji yiqin" (the actual word divider may not be like this, I just give an example). At this time, you can only search for "I" in full text ", the words "yes" and "Ji yiqin" cannot be searched if you want to search for "Yi Qin.

However, for Fuzzy queries in relational databases, you can use "% yiqin" to query them.

 

Why?

I have learned that fuzzy queries in relational databases have not created an index for the entire field for the field to be fuzzy queried, and have no word segmentation before creating an index for the word segmentation result, after obtaining the query condition of "% Yi Qin,Full table ScanTo match the regular expression.

 

Although you put the wildcard at the front, such as "like % yiqin", this query will certainly not go through the index, but will scan the entire table, however, if your fuzzy query is "like quarter %" and the wildcard is placed behind it, this query may be indexed, in addition, there are some tips for optimizing the query that includes wildcards. For example, let's look at this example: http://blog.csdn.net/firstboy0513/article/details/6912632.

 

In most relational databases, full-text indexes can be created for specified fields (note that the third-party full-text search engine is not used here), such as in MYSQL:

MATCH (col1, col2,...) AGAINST (expr [search_modifier])

Search_modifier: {in boolean mode | withquery expansion}

Example: SELECT * FROM tab_name where match (col1, col2) AGAINST (search_word );

Here, tables must be MyISAM tables, and col1 and col2 must be char, varchar, or text. Before querying, you must create a full-text index on col1 and col2.

 

Note the following:

When MySQL has high concurrent connections and a large number of database records, SELECT... WHERE... LIKE '%... % 'the full-text search method is not only inefficient, but also cannot be used for queries starting with the wildcard "%" and "_". A full table scan is required, which puts a lot of pressure on the database. MySQL provides a full-text index solution to solve this problem, which not only improves performance and efficiency (because MySQL indexes these fields to Optimize search ), in addition, it achieves higher quality search. However, up to now, MySQL does not support full-text Chinese indexing correctly.

--- This problem can be solved by using some MYSQL full-text search plug-ins.

 

For full-text search in Oracle, refer to this article: http://www.iteye.com/topic/1118055.

 

This article is my personal opinion. If you have any questions or correct me, you are welcome to discuss it. Thank you!


What are nosql databases?

1. CouchDB

Language used: Erlang
Features: DB consistency, easy to use
License: Apache
Protocol: HTTP/REST
Bidirectional data replication,
Continuous or temporary handling,
Conflict check during processing,
Therefore, master-master replication is used (see note 2)
MVCC-write operations do not block read Operations
Versions earlier than files can be saved
Crash-only (reliable) Design
Data Compression from time to time
View: Embedded ing/reduction
Format view: list display
Supports server-side document verification
Authentication supported
Real-time update based on changes
Support attachment processing
Therefore, CouchApps (independent js applications)
JQuery library required

Best Application Scenario: suitable for applications with less data changes, which execute pre-defined queries and perform data statistics. Applicable to applications that require data version support.

For example, CRM and CMS systems. Master-master replication is very useful for multi-site deployment.

(Note 2: master-master replication: a database synchronization method that allows data to be shared among a group of computers and can be updated by any group of members in the group .)

2. Redis

Language used: C/C ++
Features: Fast Running exceptions
License: BSD
Protocol: Telnet-like
Memory databases supported by hard disk storage,
However, data can be exchanged to the hard disk after version 2.0 (Note: Versions later than version 2.4 do not support this feature !)
Master-slave replication (see appendix 3)
Although simple data or hash tables indexed by key value are used, complex operations such as ZREVRANGEBYSCORE are also supported.
INCR & co (suitable for calculating limit values or statistical data)
Sets are supported (union, diff, and inter are also supported)
Support List (queue and blocking pop operations are also supported)
Support for hash tables (objects with multiple domains)
Supports sorting sets (high-score tables, applicable to range queries)
Redis supports transactions
Supports setting data to expired data (similar to the fast Buffer Design)
Pub/Sub allows users to Implement Message mechanisms

Best Application Scenario: suitable for applications with fast data changes and the database size (suitable for memory capacity.

For example, stock price, data analysis, real-time data collection, and real-time communication.

(Note 3: Master-slave replication: if only one server processes all replication requests at the same time, this is called
Master-slave replication is usually applied to Server clusters that require high availability .)

3. MongoDB

Language used: C ++
Features: it retains some user-friendly features (queries and indexes) of SQL ).
License: AGPL (initiator: Apache)
Protocol: Custom, binary (BSON)
Master/slave replication (Supports automatic error recovery and sets replication)
Built-in sharding Mechanism
Support for javascript expression Query
Attackers can execute arbitrary javascript Functions on the server.
Update-in-place support is better than CouchDB
Use memory-to-file ing for data storage
The focus on performance exceeds the functional requirements
We recommend that you enable the log function (parameter-journal)
On a 32-bit operating system, the database size is limited to approximately 2.5 Gb.
Empty databases account for about 192 Mb
Use GridFS to store big data or metadata (not a real File System)

Best application scenarios: Suitable for dynamic queries, indexes instead of map/reduce functions, performance requirements for large databases, and use
Applications in CouchDB that occupy full memory due to frequent data changes.

For example, you are planning to use MySQL or PostgreSQL, but the pre-defined columns that come with them will block you.

4. Riak

Languages used: Erlang and C, and a... the remaining full text>

How does nosql work? In relational databases, you can use select statements to query data. But how can we use this in nosql? Can we only store key-value pairs?

NoSQL databases have many different implementation methods. There are similar SQL queries and pure key-value pairs.
For K-V-type database, a typical is Redis, the system provides get, set and other commands for addition, deletion, modification, query. The key is how to design the key and value of the key-value pair.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.