NoSQL selection and HBase case study (GO)

Source: Internet
Author: User
Tags cas couchbase riak hadoop ecosystem document cloud

From NoSQL types to commonly used products, we've done a lot of nosql articles, and today we talk about NoSQL databases from famous internet companies and research institutions in China.
NoSQL is partly based on a very important principle--cap principle. Traditional SQL databases (relational databases) have ACID properties and are highly consistent, thus reducing a (availability) and P (partion tolerance). To improve system performance and scalability, C (consistency) must be sacrificed.


Based on the CAP theory, the choice of databases can be considered in three ways, depending on the needs of the application:
Consider CAs, which is traditionally a relational database (RDBMS).
Consider CP, which is primarily a key-value database, typically represented by Google's big Table, which stores column data in sorted order. Data values are distributed across multiple machines, and data update operations have strict consistency guarantees.
Consider an AP, primarily a document-oriented database for distributed systems, such as Amazon's Dynamo,dynamo, which stores data as a hash key. Its data shard model has strong disaster tolerance, so it realizes relatively loose weak consistency--the final consistency.
Surdoc is the independent innovation of the scholar Company electronic document cloud service platform, it supports cloud service and mobile terminal, based on international standard UOML, is the only product in the world that allows users to read all format documents without installing Office, PDF or other software. Scholar Network Beijing scholar software company vice President and CTO Jin Youbing expresses introduced the Surdoc development process in the database NoSQL selection and use of experience.
First, Jin Youbing expresses that the data model is the first problem to be considered in database selection, there are three kinds of data in Surdoc:
Users upload their own documents, using Distributed File System storage
User's own document repository, using the relational database MySQL.
The basic information of the document itself, shard information and storage location, etc., use NoSQL database. This part of the data is mainly characterized by a large amount of data and simple data structure. This part of the data is guaranteed to the AP in the CAP principle.
The choice of NoSQL databases needs to be considered in the following ways:
Data model and Operation model: is the application layer data Model A row, object, or document type? is the system capable of supporting statistical work?
Reliability: When updating data, is the new data immediately written to the persistent storage? Is the new data synced to multiple machines?
Extensibility: How much data do you have, and whether a single machine can tolerate it? Does the read and write volume require a single machine to support?
Partitioning strategy: Considering the requirements for extensibility, availability, or persistence, is there a need for a single piece of data to be on more than one machine? Whether you need to know which machine the data is on.
Consistency: Is data replicated across multiple machines, and how does the data distributed at different points guarantee consistency?
Transaction mechanism: Does the business require an acid transaction mechanism?
Stand-alone performance: If persistent data exists on disk, is the requirement read more or write more? Will the write operation become a disk bottleneck?
Load can be evaluated: is load monitoring supported?
Early Memcachedb Practice
Advantages:
Memcachedb leverages Berkeley DB's persistent storage mechanism and asynchronous primary and secondary replication mechanisms to enable memcached to have transactional resiliency, persistence, and distributed replication capabilities.
Ideal for applications that require ultra-high performance read and write speeds, but do not require strict transaction constraints and can be persisted in a saved scenario.
High throughput, excellent reading and writing speed, compatible with memcached interface
Disadvantages:
Only one-way master-slave replication is supported, the availability is poor. It is difficult to scale horizontally because of the problem of hit ratio, which sometimes cannot be read correctly.
Redis Practice
The database also has a memcachedb-like problem, and the NoSQL database used by Sina Weibo is the Redis,redis advantage of high performance.
Tokyotyrant Practice
Advantages:
Excellent read and write speeds and concurrency, support for the memcached protocol, and high availability through dual master replication.
Disadvantages:
It is difficult to realize the distribution of data through the client, and the redistribution of data when the nodes are deleted. The Distributed file system used by Surdoc is not well suited for Internet applications.
Couchbase Practice
Advantages:
Extremely high throughput, extremely fast read and write speeds, support for memcached protocols
Three couchbase nodes are measured to reach more than 12000 Ops
Disadvantages:
The strategy of caching all keys requires a lot of memory, the failover process is unavailable when the node is down, and some data is lost, and there is a phenomenon of suspended animation on the high load system.
The design process should minimize the number of key bits, the actual operation for a period of time, the probability of failure is higher.
Finally choose the open source implementation of--amazon Dynamo Riak
Advantages:
Complete dynamo implementation, data is always writable
High availability: There is no single point on the design, each instance consists of a set of nodes, from an application point of view, the instance provides I/O capability. A node on one instance may be in a different datacenter, so a data center problem will not result in data loss.
Increase cluster throughput by increasing nodes
Back-end storage with LEVELDB, no need for large amounts of memory
Three couchbase nodes can be measured up to about 5000 ops
Disadvantages:
Do not support memcached protocol, need to modify existing code; There is no perfect monitoring system, it needs to be implemented by itself
Riak Technical points:
Data targeting uses a consistent hash
Vector lock, which allows multiple versions of data to be available for multiple backups, increasing the availability of write operations (in exchange for high availability with weak consistency)
Fault tolerance: Sloppy Quorum, hinted handoff, Merkletree
Network interconnection: Gossip-based membership Protocol, a communication protocol that aims to communicate between nodes and nodes to achieve a centralized
In addition, Riak has the ability to perform mapreduce operations:
The map process executes concurrently on each physical node
The reduce process executes concurrently on the node that commits the MapReduce
Support for writing mapreduce processes in JavaScript and Erlang code
Based on the data model of Surdoc system and various application scenarios, the NoSQL database introduced by Jin Youbing expresses most guarantees the AP in the CAP principle, and we have mentioned that there is another kind of nosql database represented by big table. Everyone game Big Data research Group data scientist Lin Shumin introduced us to big table's open source implementation--hbase.
First, according to the official definition, HBase is not a database:
Apache HBase? is the Hadoop database, a distributed, scalable, big data store.
It is an open source implementation of BigTable, HBase uses a large number of words to replace the previous design ideas or design concepts in Google paper, if you want to learn more about hbase working principle, we must find the source or meaning of these words from the original text. Here are some of the corresponding relationships:


Throughout the Hadoop ecosystem, it is located on the upper level of HDFs.


Is the overall schema of the hbase, probably divided into three levels: the top is a client to access HBase, the middle equivalent of Regionserver, can manage the lowest level of distributed file system, the standard configuration is HDFs. Three levels are decoupled, many people have some misunderstanding about this.
Rowkey Design is one of the most important aspects of using hbase, and the next Lin Shumin is to illustrate how HBase's rowkey design can affect the performance of the system by comparing NoSQL databases with relational databases.
Example 1: Sparse matrix structure
For example, China's provincial and municipal structure is a typical tree structure: China below there are Beijing, Shanghai municipality, as well as Guangzhou, Shandong and other provinces, provinces and cities below. Using traditional relational database storage is the following way:


This is done by using HBase to design such a storage structure:


It might be possible to identify a region with this primary key, which still has two columns, but no longer uses commas to save its parent or child nodes. Because the columns in HBase can be defined in the production process, they do not need to be pre-defined. For some sparse matrix structures, the advantages of this storage method are obvious.
Example 2: Many-to-many relationships
For example, the school selection system, a student can choose a number of courses, while each course can also have a lot of students to choose, is a typical many-to-many relationship. Traditional relational database storage:


HBase Storage mode:


This design enables the dynamic expansion of each record to match the real business scenario.
Example 3:key substitution index
Like everyone's advertising system, there are 1 billion records a day. How are these records stored? One way is to put it directly into the filesystem, including what the user ID is, what the name is, and when it was done. For ease of querying, it is necessary to index and degrade performance.
This is designed in HBase:


The left column in the table is key, the first part is the user ID, the second part is the user operation time, and the third part is the event ID (which is more advantageous at the time of retrieval, which can be optimized at design time). When retrieving, first find the key on which server, the client then go directly to that server to do a very small range of retrieval. This ID is also stored on different servers, so the throughput is very large.
Example 4: Hot Issues
Users ' social relationships are stored, and large Internet companies will encounter this problem. Similar to the previous example, this can be designed in HBase:


But this design can have hot issues, such as a person may have 100,000 friends, once the query to this person, the entire server died. So this design can be improved, you can put the content of friend that column into key, this table becomes the relationship between 22, hot issues can be solved.
Example 5: Correlation Concepts
Next is a data analysis case, the business scenario is this: Suppose a search engine collects a lot of users ' search data every day, how to know what concepts users always put together to search? Such a logical relationship can be obtained in HBase, where T (n) is the search keyword:


This kind of logic relationship is not designed in relational database, its storage advantage lies in the column elasticity, only the keywords that are retrieved together will be stored, which is the advantage of storage on the traditional meaning.
In these cases, you can see that there are several issues to be aware of in HBase:
Avoid hot issues, distributed is afraid of uneven.
HBase has a concept of segmentation. A server if the amount of data stored is large, the large table or small table will be divided into two smaller tables, placed on other servers, so as far as possible to reduce these partitions.
Cherish every byte, can save resources, reduce risk, but also can reduce I/O, so that the system to achieve the best.
HBase does not have a foreign key, and there is no join and cross-table transaction.
Next, Wang Shupeng, associate researcher at the Institute of Information Engineering, CAS, shared the "new NoSQL Big Data Management System (BDMS) development and use communication." Wang Shupeng said that most of the projects he contacted were non-Internet applications, such as security and transportation. These industries are now facing big data tests, but many popular NoSQL databases are not for them, so they have developed a NoSQL database management system.
Design goals
The system is highly scalable: it can increase the linearity of the node
Support for complex data types Unified Storage Management: structured data, semi-structured data and unstructured data, text data, multimedia data, unified organizational management and processing for multiple types of business data
Support a variety of access types, Access interface standardization: retrieval, statistical analysis, correlation processing and in-depth mining, the need for a variety of business data related to comprehensive analysis, the provision of standard DDL, DML operation syntax, support JDBC, ODBC and other operational interfaces, data retrieval, statistics, analysis and processing of real-time requirements are high ; retrieval requires second-level response; cross-domain Retrieval access


Is the framework of the entire system, where the structure of the database management platform is as follows:


This enables cross-data management through the management engine. The corresponding DDL interfaces, DML interfaces, and development interfaces can be provided externally.
Main features of the system
Share-nothing's distributed storage and computing architecture
Organization and management of heterogeneous multi-source data: Unified storage Management for structured data, unstructured text and unstructured multimedia
Unified SQL query that supports heterogeneous data: supports retrieval and analysis of structured data, unstructured text, which can be implemented by SQL
Rich data access and processing patterns
Efficient retrieval mechanism
Heterogeneous multi-copy storage and recovery mechanisms
Cross-domain data management and retrieval: Supports cross-domain deployments, multiple data centers can be built in multiple physical locations, enabling data to be moved between data centers, and enabling global retrieval and access to data located in different geographies
Application Scenarios
Massive Structured records management
Handle large amounts of small document management and processing
Intelligent search and mining system for heterogeneous data
Success Stories
Wang Shupeng said the system has a successful application case, is a national ministry of the Big Data management project. The main requirements of this system are:
A large amount of information is recorded, generating about 4 billion articles per day (about 4TB);
Data retention backup copy, record Data retention half a year;
The data can be accurate, fuzzy query and statistics, the results of second-level response;
Bulk import of structured, unstructured data;
The resulting implementation results are:
A distributed storage architecture (3 meta data nodes + 115 storage nodes) is used;
Data size of more than 500 billion, query response time is the second level;
Data Retention 2 copies to ensure data security;
The system has a usable capacity of about 2PB.

NoSQL selection and HBase case study (GO)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.