SequoiaDB notes, sequoiadb notes

Source: Internet
Author: User

SequoiaDB notes, sequoiadb notes
SequoiaDB notes

In the past few days, I flipped through the SequoiaDB code and took notes. The correctness of the following content is not guaranteed (there must be something wrong)

Personal View advantages
  1. The code is good and the design is concise.
  2. The use of EDU and CB makes the entire system much simpler, and the Code pays more attention to logic.
  3. It should be designed as a distributed system, though small and dirty.
  4. There is no need to modify things in a mess. It is basically your own code (although SQL is supported, it can be considered as supported by Postgresql ).
Gossip
  1. Let's gossip about the name Sequoia, which is meant to be Sequoia. At the same time, Toyota has a car named this and the famous Sequoia Capital. I don't know what the relationship is.
  2. I glanced at their recruitment list and did not recruit any database development personnel. I don't know whether the database is enough or I don't plan to do it.
  3. I can't tell the meaning from the abbreviation of the Code. For example, the bar folder contains barBackup and barRestore. Is bar = backup + restore? Why is barrier? Ixm = index mananger? Dps = Data Protection Service? PD = Problem Determination? ? For example, the parameter of _ extentInsertRecord is deletedRecordPtr, and the parameter of _ extentRemoveRecord is recordPtr.
  4. There are very few comments. Basically, there is no nutrition to repeat the class name or something, and TODO is not found. More importantly, the comments are represented by abbreviations, it is not ruled out that the open-source version deletes all comments. The code looks a little too clean and a little too clean.
Summary

For the overall introduction, we should look at the official documents (this document looks a little primitive and should be accumulated by different people in different periods, and entrepreneurial companies cannot have a high requirement ), if you only pay attention to one abstract, please refer to this ppt.

SequoiaDB is a new enterprise-level distributed non-relational database that helps enterprise users Reduce IT costs and provides a solid and reliable storage and analysis of big data, an efficient and flexible underlying platform.

Advantages
• Through unstructured storage and distributed processing, it provides near-linear Horizontal scalability, making the underlying storage no longer a bottleneck
• Provides high availability accurate to the partition level, prevents server, server room faults, and human errors, and keeps data 24x7 online forever
• Provides comprehensive enterprise-level functions, allowing users to easily manage high-concurrency tasks and massive data analysis
• Enhanced non-Relational Data Models help enterprises quickly develop and deploy applications to meet on-demand needs of applications • Provide eventual consistency assurance to fundamentally eliminate data loss
• The combination of online applications and Big Data Analysis background databases is provided. Through the read/write splitting mechanism, data analysis and online business affairs in the same system do not interfere with each other

System Architecture

SequoiaDB uses a distributed architecture and provides a general overview of the SequoiaDB architecture.

On the client side (or application side), local or/or remote applications are linked to the SequoiaDB client library. The local and remote clients use the TCP/IP protocol to communicate with the Coordination node.
The Coordination node does not store any user data. It only serves as a request distribution node to distribute user requests to corresponding data nodes. The Cataloguing node stores the metadata of the system, and coordinates the node to communicate with the cataloguing node to understand the actual distribution of data in the data node. I
Directory nodes can form a replication group cluster.

The data node stores user data information. One or more data nodes can form a replication group (also called a partition group ). Each data node in the replication group stores a complete copy of the data of the replication group, also known as a replication group instance (or partition group instance ); data nodes in the replication group use eventual consistency to synchronize data. The data stored in different replication groups is not duplicated.
Each replication group can contain one or more data nodes. When multiple data nodes exist, data between nodes is asynchronously replicated. There can be at most one master node and several slave nodes in the replication group. The master node can perform read/write operations and read-only operations on the slave node.

The normal operation of the master node is not affected when the slave node is offline. After the master node is offline, a new master node is automatically elected from the slave node to process write requests.

After the node is restored, or the new node is added to the replication group, the data is automatically synchronized to ensure that the data is consistent with the master node when the synchronization is complete.

The architecture of a single data node is as follows:

On the data node, the activity is controlled by the engine scheduling unit (EDU. Each node is a process in the operating system. Each EDU is a thread in the node. When an external user requests that the processing thread is a proxy thread, the synchronous proxy thread is used to process intra-partition synchronization events for intra-cluster requests, or the partition proxy thread is used to process intra-range synchronization events.
All write operations on data are recorded in the log buffer and written asynchronously to the disk through the log recorder. User Data is directly written to the file system buffer pool by the proxy thread, and then asynchronously written to the underlying disk by the operating system.

External Interface

Basically, it is not much different from MongoDB. js is used, and the interface rough looks similar. Efforts are made to estimate the compatibility between the two sides. SQL supports postgresql and implements rest interfaces.

Data Model

SequoiaDB uses the JSON data model, rather than the relational data model.
The JSON data structure is fully called JavaScript Object Notation. It is a lightweight data exchange format, which is very easy for people to read and write, and easy for machine generation and parsing. Its underlying storage is BSON. The same as MongoDB
The limit for a single file is still 16 MB.

Storage Model

The specific implementation is also the classic file-Data Segment-Data Page Structure
A file can span pages (the maximum page size is 64 KB), but cannot span blocks.
I don't know whether it is copy mongoDB or the system is simplified. The underlying layer is actually mmap, that is, reading and changing pages depends on the system itself, the improvement compared with mongoDB is that there are background tasks that fl dirty pages to the disk.

Consistency and persistent Transaction Model

SequoiaDB is ACID-compatible and supports transaction operations such as commit rollback. By default, SequoiaDB can specify a parameter to open a transaction when starting a database to ensure the performance of transactions.
When transaction support is disabled, a single operation can write one or more fields, including updates to multiple sub-documents and array element updates. The ACID of SequoiaDB ensures the full isolation of document updates. Any errors will lead to rollback operations, and the customer can obtain the document consistency view.
When a transaction is opened, any operation between start and commit (rollback) of the transaction will write the transaction log in the data node and track the transaction ID. The change record holds a mutex lock before the transaction commit (rollback) and cannot be changed by other sessions. The isolation level of the current SequoiaDB transaction is UR.

Consistency

SequoiaDB adopts a set-level configurable consistency policy, allowing you to determine whether to wait for confirmation from the slave node when modifying a record based on the set of the record.
For example, for a set with high performance requirements and data reliability requirements, you can set the write concern parameter to 1 when creating a set, it indicates that a successful message is returned as long as the data is written to the master node. This operation is asynchronously sent to the slave node in the background for execution.
For a set with low performance requirements and high data reliability requirements, you can set the write concern parameter to 3 when creating the set, it indicates that the operation can be returned only after the operation is confirmed successfully by at least three nodes.
When the number of write concern is equal to the number of nodes in each replica set of the set, the system can be considered strongly consistent; otherwise, the data is eventually consistent.

Comparison with MongoDB

Overall, SequoiaDB is a refine version of mongoDB. Corrected or strengthened many mongoDB problems, such as the notorious lock, Page Management, and limited support for transactions. This should be the so-called post-development advantage. Someone mentioned above has made it easier for you to solve the problem.
As for the quality of both parties, the distributed architecture, and the durability, it is a matter of benevolence and wisdom.
Let's take a look.
What I personally understand is that

  1. The row-Level Lock is implemented, which greatly improves the insert performance compared with mongoDB. Some test reports also prove this.
  2. Although it is still mmap, it seems that some refined management has been done.
  3. With the most basic transaction support.
  4. Do not do complex mempool, cache pool and other things; put limited energy into the most needed place. (I think that even if the dependency on system implementation is always better than having a messy copy, the kernel writers are top programmers, even if their programs are not optimized for the database ).
VS FounderXML

The comparison here ignores the huge differences between XML and Json.
Compared with FounderXML, SequoiaDB gave up or did not support many things:

  1. Transaction. Instead of introducing complicated Postgresql or its transaction database/engine, C: SequoiaDB supports read uncommit. Of course, you can develop one by yourself. Well, let's forget the idea of such a genius.
  2. Support for large files. Is there any? This is a complicated choice. It does bring more complexity to the system, and the performance is also dragged down. If not, there is a real need. From the perspective of a product, it is still not supported.
    Compared with FounderXML, SequoiaDB mainly advances in the following aspects:
  3. Quickly complete a product with a simplified design (at the cost of sacrificing certain features.
  4. The dependency is relatively small, and the core code is basically self-developed. It only depends on boost, php, parser, crypto, and a few other things. Suddenly I think of a buddy who said that a database was written by chemistry learners. The less dependency point is a little farther away from chemistry.
Other code-related CB

CB indicates a control block. This is an important interface in SequoiaDB: It depends on the control logic.

   /*      _IControlBlock define   */   class _IControlBlock : public SDBObject, public _ISDBRoot   {      public:         _IControlBlock () {}         virtual ~_IControlBlock () {}         virtual SDB_CB_TYPE cbType() const = 0 ;         virtual const CHAR* cbName() const = 0 ;         virtual INT32  init () = 0 ;         virtual INT32  active () = 0 ;         virtual INT32  deactive () = 0 ;         virtual INT32  fini () = 0 ;         virtual void   onConfigChange() {}   } ;   typedef _IControlBlock IControlBlock ;
Code entry

The entrance to the entire database is in pmmain. cpp, which basically reads the configuration, initializes a bunch of mananger, and restores the data that failed the last time. The key code is to start different code blocks (CB) based on different startup types ):

Void _ pmcontroller: registerCB (SDB_ROLE dbrole) {if (Rule = dbrole) {rule (sdbGetDPSCB (); // DPS PMD_REGISTER_CB (sdbGetTransCB ()); // TRANS extract (sdbGetClsCB (); // CLS PMD_REGISTER_CB (sdbGetBPSCB (); // BPS} else if (Signature = dbrole) {signature (sdbGetTransCB ()); // TRANS round (rows (); // COORD PMD_REGISTER_CB (sdbGetFMPCB (); // FMP} else if (rows = dbrole) {PMD_REGISTER_CB (sdbGetDPSCB ()); // DPS aggregate (sdbGetTransCB (); // TRANS aggregate (sdbGetClsCB (); // CLS aggregate (sdbGetCatalogueCB (); // CATALOGUE aggregate (sdbGetBPSCB ()); // BPS PMD_REGISTER_CB (sdbGetAuthCB (); // AUTH} else if (Signature = dbrole) {signature (sdbGetDPSCB (); // DPS PMD_REGISTER_CB (sdbGetTransCB ()); // TRANS extract (sdbGetBPSCB (); // BPS} else if (if = dbrole) {extract (sdbGetDPSCB (); // DPS PMD_REGISTER_CB (sdbGetTransCB ()); // TRANS PMD_REGISTER_CB (sdbGetBPSCB (); // BPS PMD_REGISTER_CB (sdbGetAuthCB (); // AUTH PMD_REGISTER_CB (sdbGetOMManager ()); // OMSVC} // Data Management Service Control Block // This file contains code logic for data management control block, which is the metat // data information for DMS component. // PMD_REGISTER_CB (sdbGetDMSCB () including collection space; // DMS // related to context, create and delete context PMD_REGISTER_CB (sdbGetRTNCB ()); // RTN // SQL PMD_REGISTER_CB (sdbGetSQLCB (); // SQL // set PMD_REGISTER_CB (sdbGetAggrCB ()); // AGGR // start the server/rest server/manage sessionInfo PMD_REGISTER_CB (sdbgetpmcontroller (); // CONTROLLER}
Scheduling unit (EDU)

There are different entry functions for different edus. For example, listening to a port is

      rc = pEDUMgr->startEDU( EDU_TYPE_TCPLISTENER, (void*)_pTcpListener,                              &eduID ) ;

It is defined:

// Start a thread based on type (more complex than this one), pass the parameter in, and return the eduid INT32 _ pmdEDUMgr: startEDU (EDU_TYPES type, void * arg, EDUID * eduid)

The system starts a thread and calls the pmtcplistenerentrypoint function.

Static const _ eduEntryInfo entry [] = {values (EDU_TYPE_SHARDAGENT, FALSE, values, "ShardAgent"), values (EDU_TYPE_COORDAGENT, FALSE, pmagententrypoint, "CoordAgent "), // The final call _ pmdataprocessor: processMsg processes each entry. By the way, the list of all commands (EDU_TYPE_AGENT, FALSE, pmlocalagententrypoint, "Agent"), aggregate (EDU_TYPE_REPLAGENT, FALSE, marker, "ReplAgent"), aggregate (EDU_TYPE_HTTPAGENT, FALSE, pmhttpagententrypoint, "HTTPAgent"), aggregate (EDU_TYPE_RESTAGENT, FALSE, pmdRestAgentEntryPoint, "RestAgent "), // port listening listener (listener, TRUE, pmtcplistenerentrypoint, "TCPListener"), // rest listening listener (EDU_TYPE_RESTLISTENER, TRUE, pmdRestSvcEntryPoint, "RestListener"), listener (EDU_TYPE_CLUSTER, TRUE, TRUE, pmcbmgrentrypoint, "Cluster"), aggregate (aggregate, TRUE, pmcbmgrentrypoint, "ClusterShard"), aggregate (aggregate, TRUE, pmclsntyentrypoint, "clusterlog1_y"), aggregate (EDU_TYPE_REPR, TRUE, iterator, "ReplReader"), iterator (EDU_TYPE_LOGGW, TRUE, pmloggwentrypoint, "LogWriter"), iterator (EDU_TYPE_SHARDR, TRUE, iterator, "ShardReader"), iterator (EDU_TYPE_PIPESLISTENER, TRUE, listener, "PipeListener"), listener (listener, FALSE, listener, "Task"), aggregate (EDU_TYPE_CATMAINCONTROLLER, TRUE, pmcbmgrentrypoint, "CatalogMC"), aggregate (EDU_TYPE_CATNODEMANAGER, TRUE, pmcbmgrentrypoint, "CatalogNM"), round (latency, TRUE, pmcbmgrentrypoint, "CatalogManager"), round (EDU_TYPE_CATNETWORK, TRUE, success, "CatalogNetwork"), round (latency, TRUE, response, "CoordNetwork"), response (response, TRUE, response, "DpsRollback"), response (EDU_TYPE_LOADWORKER, FALSE, pmloadworkerentrypoint, "MigLoadWork"), response (EDU_TYPE_PREFETCHER, FALSE, pmpreloaderentrypoint, "PreLoader"), aggregate (partial, TRUE, pmcbmgrentrypoint, "OMManager"), aggregate (EDU_TYPE_OMNET, TRUE, partial, "OMNet"), aggregate (EDU_TYPE_SYNCCLOCK, TRUE, TRUE, pmsyncclockentrypoint, "SyncClockWorker"), ON_EDUTYPE_TO_ENTRY1 (EDU_TYPE_MAXIMUM, FALSE, NULL, "Unknow ")};
Memory Management

Although SDBObject is defined, many classes are inherited from this class. This class overwrites new/delete, but it still uses malloc at the underlying layer to solve the problem. If it is not ruled out, it will be upgraded to mempool. Currently, only some checks are added.
It seems that the memory used for query is temporarily allocated, and the system memory should still be consumed on mmap.

Query execution

A basic query execution method:

  1. RtnQuery completes the build of the query plan (query optimization seems to have done some things, but it has not been time to look at it) and locates at the first result.
  2. RtnGetMore continues scanning results from the first result.
    This avoids the possibility that the query results are too large.
Operating System Encapsulation

OSS encapsulates common system calls, such as read, create, open, and malloc. Btw, some documents only list the versions supported by linux. From the code point of view, windows should also be supported.


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.