DB2 management: Super availability

Source: Internet
Author: User
Tags db2 transaction

I have previously introduced the DB2 development experience. This article mainly introduces the super availability of DB2 management.

Definition of super availability

I still clearly remember the most recent work meeting attended by a senior IT expert from a large company and several DB2 DBAs and system programmers. The theme of the conference is strategic availability, that is, long-term strategy and planning. It is closely related to service delivery of the company's core business applications. We ask this expert to explain the goals that need to be centered around the development strategic availability plan. His answer is:

"Never go down, never lose anything"

These few words coincide with the idea of BHAG (Big Hairy audacous Goal, pronounced "bee-hag"), which was created by Jim Collins (author of Built to Last and Good to Great. BHAG advocates bold and forward thinking. (One of my friends has explained BHAG thinking in this way: "If the goal is set to low, the achievement will be the same .")

Let's analyze this availability target.

Never go down

First, IT is necessary to understand the meaning of "never go down" (that is, my understanding of IT and the definition of the IT expert mentioned above ). It does not mean the normal running time of the database or application server as a percentage of the elapsed time within a certain period of time. Do not misinterpret what I mean. It is useful to set and strive for the best normal running time (for example, 99.9% availability, this is especially true for those responsible for the application infrastructure (system administrators. However, from the business perspective, the more important measurement of availability is the number of failed customer interactions (FCI ). This is an important indicator for differentiation. Consider the following two scenarios:

Scenario 1:The IT department manager of Company A is celebrating performance with enthusiasm: The availability indicator has reached 99.95% in the past month. Unexpectedly, the customer relationship manager in charge of Customer B suddenly informed the IT department that company B was not satisfied, the reason is that A large part of the transactions sent to Company A last week have not been properly executed. What's going on? In fact, this is not hard to understand. It may be that an infrastructure component (such as an application server) that processes Company B's requests has a temporary failure. Of course, this is not enough for systems with hundreds of servers to damage availability indicators, but it is enough to cause complaints from Company B. Another possibility is that the program logic errors of an application that the team in charge of the system of Company A does not notice happen to affect the transaction processing sent by Company B.

Scenario 2:A week later, Julie, A DB2 DBA of Company A, encountered A tricky problem-A DB2 instance on A Linux server was affected by downtime. This DB2 instance is Part of the HADR (High Availability Disaster Recovery) configuration (HADR is a DB2 for LUW feature I described in "DB2 Disaster Recovery, Part 2 ). The affected instance processed the request again 20 seconds later, but the 20 seconds had no help in reaching the availability target for this month. The Relationship manager in charge of Customer B happened to pass by, and Julie was ready to deal with this inevitable quarrel. However, I did not expect the relationship manager to praise her outstanding performance: Company B's transaction did not have any fault (the transaction timeout value is 30 seconds ).

Therefore, the true meaning of "never go down" is "No FCI", rather than "No server downtime ". This is undoubtedly a good thing: Some servers that are not enough to cause FCI to crash temporarily. This means that the HADR failover action initiated to apply DB2 for LUW maintenance outside the maintenance window has no conflict with the availability target of "never downtime" (DB2 for z/OS data sharing also allows application maintenance to be applied outside the maintenance window, as described in my DB2 Disaster Recovery column in reference section ).

The bad news is that senior IT experts have not defined IT as "never go down due to local faults". He only said "never go down ". This means that the IT department cannot stand by for the recovery of application services that need to be performed after the entire data center is paralyzed due to disasters (floods, fires, earthquakes, and Tornado.

There are already a lot of great technologies available to significantly reduce the time required for the application system to resume operation at the standby site in the event of a disaster at the primary site, among them, the best is Data Replication Based on disk arrays. There are also some programming practices (such as frequent submission) and DB2 parameter settings (such as checkpoint frequency) that can be used, however, restoring the system after a fault occurs in the data center does not make the user feel any downtime. Even if it is seamless, this goal seems very difficult to achieve.

Think about IT: IT is no longer easy for the IT department to make applications ready on the standby site within 30 minutes of a disaster on the primary site, however, 30 minutes cannot be counted as "Never Going Down ". What should I do? How can we compress the system recovery time to the minimum possible value? You may think about the solution from another perspective.

Never recovered

Obviously: to speed up disaster recovery to the maximum extent, it means not to recover. The fastest recovery is not recovery. How is this possible? On the one hand, please do not think about the so-called "System Recovery". On the contrary, you should begin to think about "Application Service Recovery ". My favorite strategy is to run A complete instance of an application (including A database) on Site A and site B, where business traffic is separated. If Site A is invalid, do not try to recover application system A on Site B. Instead, route the job directed to site A to site B.

Very simple, right? I know that implementing the above scheme is not easy. The premise of success is that the instances of these two complete but geographically different application databases should be synchronized (or almost synchronized) to each other ). I personally love Data Replication Based on disk arrays, but that (at least, only itself) is not a good Database Synchronization policy. In this case, the main problem of hardware-based replication is that the target disk volume of the remote site cannot be used by the server of the remote site when it is used for replication. In other words, in Site B, there are both "active" database volumes for applications, there are also volumes required to replicate database changes made at Site A-disk array-based replication does not result in changes made at Site A being reflected to Site B's "active" database volumes. If Site A fails, the target copy volume of Site B can be used by site B, but for practical purposes, these volumes are available only when the DB2 instance of Site A is restored at site B. This is not what we want, because our goal is to restore application services for clients affected by site A's failure, without the need to perform DB2 recovery on site B.

Therefore, if the disk array-based replication is out of consideration for Database Synchronization (even if the technology is very effective for copying some non-database files ), so what technology should we consider? I chose a software-based replication solution. These tools (from multiple vendors, including IBM) the most complex part is the interface with the DB2 transaction log manager to identify database changes at Site A and pass these changes to site B (the "capture" block of the replication process ). In Site B, copy the "application" Block of the software product and use the DB2 SQL interface to respond to changes made to the database of Site A in the database of site B (transfer the database changes of Site B to the site ). in the opposite process of, ).

This originally effective solution is also unsatisfactory: the two-way Database Change replication between Site A and site B is essentially asynchronous, because software-based Data Replication tools need to wait for a DB2 COMMIT to flow first at a site before passing relevant database changes to a remote site. This means that the two instances of the application database will not be precisely synchronized. In fact, they only perform approximate synchronization. There is a slight latency (probably only a few seconds) between submitting database changes at one site and applying the changes to another site ).

However, this is already quite good, isn't it? When a data center-level fault occurs, can the service be quickly restored, and only a few seconds of data changes are lost? What are the higher hopes? But in addition to "never go down", there is another high requirement: "Never lose anything ".

Stricter Objectives

When the IT expert I mentioned gave the goal of "never going down, never losing anything", I smiled and said to him: "You mean never lose any submitted database changes, right? Transaction-related database changes in progress when a fault occurs will be lost, right ?" The expert smiled and said, "No. I mean we will never lose anything, even if it is just a part of the transaction ."

The people in the same house can't help but say (or think): "This is impossible ." But we decided to try to believe this possibility. What if we can find such a solution-it can make the user not aware of the system failure even when the fault occurs and the current affairs are still in progress-What should we do? What should such a solution look like?

I can't help but think of the answer: to input all transactions ("input" refers to the requested application service, parameter values, such as accounts or order numbers) both are sent to two sites, however, in Site A, only "play" The inputs related to the users who route to site A (those who route to site B will perform the same processing on Site B ).

Now, it is the skill of the problem. Bad! Site A fails when A major fault event occurs. Once the site is confirmed to be invalid (I hope it will be within A few seconds), the process of Site B taking over the traffic normally routed to site A will be started.This process will implement the following goals::

1. All the transactions that flow to Site B are suspended temporarily.

2. database changes that are submitted to site A but not applied to Site B are applied to site B (remember to mention that due to the asynchronous feature of the DB2 transaction log-Driven Data replication tool, these two databases will not synchronize with each other ). Why? This is because data replication tools all have their own log files, these files can be synchronized at the standby site through disk array-based replication (although this is only feasible within 20 to 30 miles using fiber connections between the two sites ). Ideally, the copy tool installed on Site B can read A copy of the log file of the tool installed on Site A (once the volume is manually available to the server on Site B) the information is used to eliminate the time gap between the database instances of Site A and site B applications. [Hey, copy tool provider, this may be a market opportunity for you!]

3. Now, it processes the transactions that are still being processed when Site A is offline. My idea is that every transaction processed at Site A will be assigned A unique identifier at the beginning of the transaction execution. If the identifier is a hash value based on the transaction input (as described earlier, these inputs are sent to two sites), as long as the two sites use the same hash algorithm, each transaction of Site B has an identifier and does not need to be copied at all. When A transaction is executed at Site A, it sets A "done" indication for the transaction, the completed/unfinished files will also be synchronized to site B (probably through array-based replication. At site B, the identifier of the transaction routed to site A is compared with the "done/unfinished" value copied by site. If the transaction received at Site A has not been completed at Site A, the related transaction input (which must be sent to two sites) will be "played" at site B ", the result is sent back to the submitter (or the application process ).

4. Release suspended transactions (see step 1) for execution and recovery.

It's easy, isn't it? Joke. I know this solution is quite complicated, and it may take some User Programming to make it work. However, I believe that this "Never lose anything" solution (and the "never stop" feature) can be designed and implemented (in fact, similar plans constitute the foundation of this expert's solution that never goes down/never loses anything ). Do your best to move ahead. Super availability can be achieved.

Finally, do not try to escape your company's IT-related BHAG. If your company does not have such a thing, please strive for it. In any case, be sure to participate. They will allow you to exert your full potential and strive to create your own world. I believe that tomorrow will be brilliant.
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.