Unexpected outage of cloud Platform database host

Source: Internet
Author: User

Problem Introduction:

Many companies use their own private cloud environment, they choose to divide the host collection, like this

650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M00/87/C4/wKiom1fg8ini8vG8AACccagB80A504.png "title=" Host collection. png "alt=" Wkiom1fg8ini8vg8aacccagb80a504.png "/>

Good, well done, but the essence of the new host collection is: differentiated treatment, each zone contains physical nodes with different physical configurations

Say:

1.zone1 to create a new CPU-intensive cloud host

2.zone2 used to create new high-memory cloud hosts

3.zone3 used to create new hard disk IO requires a higher cloud host

If not differentiated treats, that divides what host collection.


The following is a case in our company:

One: The problem: Production environment DB Master node was suddenly down at 19th noon, causing a business interruption.

Two: Problem solving:

Production to the first time to resume business as a guideline, so too late to investigate the reasons

1. Both horizon and CLI two ways to start the cloud host failed, the host status is error, this state can only perform the delete operation

2. Uninstall Cloud Host Cloud Disk failed, cloud host at the moment is unable to uninstall cloud disk, so delete the original cloud host, Cloud disk back to available state

3.CLI under Load admin environment variable Nova reset-state--active backend-mysql-01, state Reset after making a snapshot, execution (at this time is not rebuild, only to create a new host from the snapshot, specify a fixed IP)

Nova boot prod-zabbix02-mysql01--flavor c1.medium--image 7388c74b-bf8f-4b64-911e-40f838840602--security_group Zabbix-sec-group--nic net-id=bcb3cef7-da93-450c-9f2f-83279d24e9a4,v4-fixed-ip=172.30.0.21

4. Re-mount the cloud disk

5.DB Group start Service OK

Three: The sky flew an e-mail (the following from the database Director Mail, when the message was sent, the failure has been resolved):

recent production Environment DB Server consecutive burst of downtime, from the server/var/log/message without any relevant information, from the monitoring, failure period database load is very low, I hope that cloud computing department can analyze the reason from the virtual machine level .


September 19 around 20:30 Zabbix System Database Sudden Outage 172.30.0.21 prod-zabbix02-mysql01

September 19 around 12:00 Background Management Project Database sudden outage 172.40.0.34 backend-mysql-01 mysql node

September 3 around 21:20 Background Management Project database sudden outage 172.40.0.36 backend-mysql-03 mysql node

In addition, the previous virtual machine failure generally can be restarted quickly, but these several times the virtual machine can not be started, resulting in a longer recovery time.



Four: Forced analysis of the cause (in fact, I was refused, I am not satisfied, until the vice president also sent a congratulatory message)

For yesterday PROD environment backend-mysql-01 accident cause analysis of the following

    

backend-mysql-01 Run with compute03 node, compute03 node error log 650) this.width=650; "src=" Http://s5.51cto.com/wyfs02/M01/87/C1/wKioL1fg80rBvmVOAACMSWZtSc8015.png "title=" Compute03.png "alt=" Wkiol1fg80rbvmvoaacmswztsc8015.png "/> 650" this.width=650; "src="/e/u261/themes/ Default/images/spacer.gif "border=" 0 "class=" image_zoomin "style=" border:1px solid rgb (221,221,221); Background-image:url ("/e/u261/lang/zh-cn/images/localimage.png"); Background-position:center;background-repeat: no-repeat; "alt=" Spacer.gif "/>

    

This problem occurs because The VM allocates too much memory (or even more than the physical host's memory size), BACKEND-MYSQL-01 uses M2.xlarge, and memory is 32G

650) this.width=650; "src=" Http://s3.51cto.com/wyfs02/M00/87/C1/wKioL1fg-Naw0BPIAAAoFM5Uruk767.png "title=" vm.png "alt=" Wkiol1fg-naw0bpiaaaofm5uruk767.png "/>

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "border=" 0 "style=" Background:url ("/e/u261/ Lang/zh-cn/images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

While the COMPUTE03 physical memory surplus is 10G, so once the database load is too high memory usage will cause the outage phenomenon, restart due to compute03 physical memory is not enough to restart the successful

Add one point:

When the cloud host is newly created, the memory allocation is super- ram_allocation_ratio=1.5(This is the default configuration)

This allows you to create a new cloud host of 15G memory on the physical node if the remaining memory is 10G.

This is an optimization strategy in the case of sufficient resources (each host will not actually be able to use 100% of memory, and after the memory is over, it means that we can create more cloud hosts), but for applications with very high memory requirements such as DB, this configuration is a a fuse (memory overflow) in which the cloud host is down and cannot be restarted.

5:3 Different Solutions

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "border=" 0 "style=" Background:url ("/e/u261/ Lang/zh-cn/images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Solution One:

New compute node (high memory configuration), separate host collection for use in DB Department, hyper-sub-set ram_allocation_ratio=1.0

Solution Two:

Upgrade compute node Memory

Solution Three:

1. Statistical production environment resources, screening resources sufficient host

2. Add a host collection to include the existing fully resourced host in the collection, and then create a new host using the host collection


Three scenario comparisons (in DB applications separate from other applications, DB applications run with separate physical node principles):

Solution One: Optimal, no need to stop the node, database application to the unique performance requirements decided: it should be in different physical nodes with other applications. So we need to allocate high-performance physical machines separately

Scenario Two: Better, need to stop node upgrade memory, but also a problem-solving method

Scenario three: The worst, can solve short-term problems, still do not apply the DB to other applications to separate



This article is from "A Good person" blog, please be sure to keep this source http://egon09.blog.51cto.com/9161406/1854592

Unexpected outage of cloud Platform database host

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.