Rac3--rac principle Begins

Source: Internet
Author: User

1. RAC concurrency

The nature of RAC is a database, but now the database is running on multiple computers, and in the original single instance, whether a process can modify one piece of data depends on whether there are other processes (on the same computer) that are concurrently modified. In a RAC environment, this judgment is not enough, and you must insist that the processes on other computers have concurrent modifications.

So the first question that RAC solves is how to perceive concurrency in multiple computer environments.

For checking the concurrency on this machine, it can be solved by using the locking mechanism in the traditional single instance, and a new mechanism must be introduced for concurrency detection on other computers, which is the distributed lock Manager (Distributed lock Management DLM). We can think of the DLM as a "quorum", and he records which node is using which data, which is better than coordinating the contention between nodes.

Let's use an example to illustrate the mechanism of DLM:

A 2-node RAC, Node 1 wants to modify data A, Node 1 wants the DLM request, DLM discovers that data A has not been used by any node, DLM authorizes node 1, and registers node 1 with the DLM for the use of data 1, and Node 2 also wants to modify data A, Node 2 to the DLM request, DLM discovers that data A is being used by node 1, and DLM requests node 1, "First give Node 2," and node 1 releases its consumption of data 1 after receiving the request, and Node 2 is able to manipulate data a. DLM records this process.

It should be emphasized that the DLM is responsible for the coordination between nodes, and that intra-node coordination is not the responsibility of the DLM, we continue to consider the above example:

At this point, process 1 of Node 2 modifies data A, node 2 of Process 2 also wants to modify data A, Node 2 still requests the DLM, but DLM discovers that node 2 already has permissions, unordered authorization. Process 2 Requests for DLM are passed, but process 2 is able to modify data a, and it needs to be checked further through the traditional lock mode.

After solving the first question, the second question arises, what is the data a we mentioned above? Or does DLM reconcile the conflict of resources at that level? Is that a a line of records? or a block of data? Or is it a data file? Haha--The answer is: Data block!!!

That is, when a process wants to modify a, the request to the DLM is "block A's operation rights".

Oracle Cluster development history is divided into two stages, originally Oracle Parallel server (Oracle Parallel server OPS), then to 9i to RAC, two stages of the DLM name is also different, OPS called Pcm,rac is called the cache Fusion. Now it seems that we only need to know one thing: Now the name of the DLM is cache Fusion.

In DLM, resources are divided into two categories, based on the amount of resources and activity-intensive: Cache Fusion and Non-cache fusion.

Cache Fusion Resource: refers to data blocks such resources, including ordinary data blocks, index data blocks, segment header blocks (segement header), undo data blocks.

non-block resources are all categorized as Non-cache Fusion resource: including data files, control files, data dictionary views, library Cache,row caches, and more.

For the typical non-cache-fusion resources, we have a description of the library cache, the main memory of the library cache is SQL statements, execution plans, plsql packages, stored procedures, as well as the objects referenced by these objects, When these SQL statements are compiled, a library cache lock is added to the objects applied to those objects, and when these SQL objects are executed, the referenced objects are added to the library cache Pin to ensure that the structure of the applied object does not change during the execution of the SQL statement.

In particular, when the compilation is complete, the library cache lock on the reference object will be converted from shared or exclusive mode to NULL mode, and the library cache look of the null mode is equivalent to a trigger, Whenever the structure of the referenced object is changed, or the definition is modified, such as adding a column: Then the object that references his SQL statement is invalidated, and the SQL statement needs to be recompiled. For example, select * from A. After compiling this statement, the execution Plan object will add a null schema to the library cache lock on a. When we change the organization of a (such as adding a new field), this trigger causes the execution plan of the SELECT * from a statement to expire. When you re-execute this SQL, you need to recompile.

This problem extends further in the RAC environment, where it is possible to have a reference object for table A on each node. The structure of a is modified on any one node, and the object of a on all other nodes should be invalidated. Therefore: In addition to the traditional library cache lock, the LCK0 process for each node adds a shared mode IV (invalidation) instance lock to the object in the library cache for this instance. If a user wants to modify the definition of an object. A exclusive mode IV lock must be obtained first, which informs the local LCK0 process to release the shared mode lock. Before the local LCK0 releases this shared mode lock, the other node's LCK0 is notified, and the other node's LCK0 process receives the message, which invalidates the related object in the local library cache.

This is a broadcast mechanism, which is accomplished through the LMD of the instance (this process details the next section) process.

The Row cache holds a data dictionary, which is intended to reduce access to the disk by the compilation process. Its contents also need to be synchronized across all instances. The synchronization mechanism is the same as the library cache, and is also done by the LCK0 process.

2. GRD (Global Resource Directory)

You can think of GRD as an internal database, where each block of data is distributed between clusters, which is located in the SGA of each instance, but each instance of the SGA is part of the GRD, and the GR rollup for all instances is a complete GRD.

The RAC chooses a node from the cluster as its Master node based on the name of each resource, and the other node is called Shadow node. The usage information for this resource on all nodes is recorded in the GRD of Master node, while Shadow node's GRD only needs to record the usage of the resource on that node, which is actually the PCM lock information. PCM Lock has 3 properties: Mode, Role, and PI (past Image). Shows the GRD content structure:

3. PCM Lock

We know from the above that the PCM lock information is recorded in GRD, which has 3 properties: Mode,role,pi.

Let's take a look at these three attributes:

1) Mode: This property is used to describe the mode of the lock, which has 3 values, as follows:

2) Role: Each data block can be modified by multiple nodes, role this attribute is used to describe the "dirty data Block" in the distribution between clusters, including local and global two values, the following combination mode to explain the meaning of each role:

For local role, the possible mode is only S and X, if mode is S, the memory block is identical to the contents of the disk, and if mode is X, the data block is modified in memory, but the modification is not written back to the disk, that is, "dirty chunks" For instances with local role, if you want to write this block to disk, you do not need to contact GRD, which is done by this instance.

If an instance of X mode with local role is to send this data block to another instance, if it is sending a version that is consistent with the disk, that is, the receiving party receives the same version of the disk, then this instance remains local role, and if it is sending a version that is inconsistent with the disk, The role is then turned into global, and the receiver's role is global, representing multiple instances with "dirty block" versions.

If it is global role, the possible mode is S,x,null,global role first means that there are multiple instances owned and a consistent version of the disk, if you want to write this data to the erase disk, you must contact GRD, with the current version of the data block to complete the write action.

3) Past Image: Below is an example of what past image is, assuming a 2-node RAC Cluster, a block of data scn=100 on disk:

Okay, here we go. Example 1 to modify this data block, read from disk into the SGA for modification, modified memory scn=110. Example 2 also to modify the data block, instance 1 will be passed through the cache fusion to the data block to instance 2, the transmission is scn=110 version, that is, the current copy of the data block, then instance 1 will also retain this scn=110 data block in the SGA, However, it is not possible to make any modifications, and this copy of instance 1 has a past image, where scn=110; Before instance 1 sends the data block, it writes the contents of the log buffer to the redo log. Next Instance 2 modifies this data block, the modified scn=120; Note that the version on the disk is still scn=100, assuming that instance 1 now triggers a checkpoint because of log switching because the data block on instance 1 is a dirty block (but not the dirtiest, haha, and the instance 2 on the scn= 120 is the most dirty version), so the data block is also synchronized to the disk. Instance 1 finds GRD and discovers that instance 2 owns the current version of the block, and GRD notifies instance 2 to write the data block to disk. When instance 2 finishes writing, it notifies other instances (all instances that have a PI version) to release the PI memory they have. At this point, instance 1 records a BWR (block write record) in log buffer and then releases the PI memory.

Assuming that instance 2 does not complete the write when the exception is down, this will trigger the instance 1 on the crash Recovery (different and single Instance instance recovery) Although the modification actions are recorded in the online log of each node, but because instance 1 has the nearest pi, Therefore, only the instance 1 pi and the online log of instance 2 can complete the recovery.

So, past image represents whether the SGA in this instance has a version that is inconsistent with the contents of the disk, and the version order, does not mean that the node has ever modified the data block, past image is mainly able to accelerate the crash recovery recovery process.

Here's how the RAC works by reading a realistic example:

4. AST

So far, you have already learned how the so-called Cache Fusion resource (that is, the data block) is transferred to work, but the previous narrative deliberately omitted a detail has not been explained, is how these requests in the DLM is managed, mainly to avoid distracting the reader's attention, Now, fill in this part of the content.

The DLM uses two queues to track all lock requests, and two asts (asynchronous traps) to complete the request send and respond, which is actually an asynchronous interrupt (interrupt) or trap. The relationship between the resource and the queue is displayed, and the granted queue records all the acquired lock processes, while the convert queue record is all the processes waiting for lock.

Process 1 and Process 2 have the lock of the data block S mode, so there is a record in the granted queue, assuming that the process 2 to obtain the X-mode lock, process 2 must first make a request to the DLM, and after the request is submitted to the DLM, the DLM will place process 2 in the convert queue. A blocking ASTs is sent to process 1 with an incompatible mode lock, which is an asynchronous request, so the DLM does not have to wait for a response. When Process 1 receives this bast, the lock is downgraded to null mode, and DLM converts the lock mode of process 2 to x mode, as shown in:

The DLM then sends a acquisition ASTN (Aast) to process 2 and puts process 2 into the granted queue, as shown in process 2 to continue processing:

5. RAC Concurrency control Summary

In cache fusion, each block of data is mapped to a cache fusion resource, or a PCM resource, and the PCM resource is actually a data structure, and the name of the resource is the address (DBA) of the data block. Each data request action is done in step. First, the data block address X is converted to the PCM resource name, and then the PCM resource request is submitted to DLM,DLM for the global lock application, release the activity, only the PCM lock can continue the next step, that is, the first step "instance to obtain the use of data blocks"

In addition to the use of data blocks, the data block state is also considered. In a single instance, the process wants to modify the data block, it must be modified on the current copy of the block, as in the RAC environment, and if the instance wants to modify the block, it must obtain a copy of the current version of the data block. This involves a series of questions: How to get a copy of a block of data between cluster nodes, how to know which node has the current copy, and how to complete the transfer process, the solution to these problems is the memory fusion technology (cache fusion). Once the instance has been granted access, and the correct version is obtained. The process can then access the resources, and the process still uses the traditional lock,latch, which is no different from the single instance.

Ext.: http://blog.csdn.net/cymm_liu/article/details/7899432

Rac3--rac principle Begins

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.