Click to have a surprise
In the first case: one of the observer hangs off, database exception
Oceanbase uses the multi-replica cluster model, which has different servers in different zone, and each zone server is backed up with each other. The upper layer is distributed through SLB so what happens to the business when one of the servers hangs.
Manufacturing Phenomenon: 50 users concurrent a query transaction, kill off a certain (business observer and business no longer observer) observer.
Log on to the OB server where the business is located to execute the following command:
Ps-ef|grep observer;kill-9 PID
Expected impact: TPS was first dropped to 0, returned to normal within 1 minutes, and there was a business failure.
Monitoring found: TPs first dropped to 0, 40 seconds after return to normal. Failed 238 pens.
Log on to the observer of the business execute the following command:
Ps-ef|grep observer;kill-9 PID
Expected impact: No impact on the system.
Monitoring discovery: Transaction TPS and response time have not changed significantly
Recovery method: View the 9 observer process, find the hanging observer, and use the Admin user to start the observer. The following commands are executed:
Su-admin;/bin/observer
2ã server level CPU usage, database exception
Manufacturing phenomenon: Fills the server CPU on the observer on which the business resides.
Log on to the server on the observer where the business is running script, the script reads as follows:
#! /bin/bash
# filename killcpu.sh
endless_loop ()
{
echo-ne "i=0;
While true
do
i=i+100;
I=100 done
"|/bin/bash &
}
If [$#!= 1]; then
echo" USAGE: $ <CPUs> "
exit 1;
Fi for
i in ' seq $ '
do
endless_loop
pid_array[$i]=$!;
Done to
I in "${pid_array[@]}"; Do
echo ' kill ' $i
'; Done
#运行命令:./killcpu.sh
#参数位占用几颗cpu Nuclear number
Expected impact: TPS drops and server CPU utilization is high.
Monitoring discovery: TPS reduced from 700 to 550,OB CPU utilization 90%, maintaining smooth
Recovery method: You can use the Linux command to find out the current database to occupy CPU more processes, to determine whether it is important to kill
Ps-aux | Sort-k4nr | Head-n
3, server level full disk, database anomaly Walkthrough
Manufacturing phenomenon: Full disk of observer server
Log on to the Observer server to execute the command:
DD If=/dev/zero Of=/home/admin/oceanbase/log/1.log bs=100k count=1600000
Expected impact: TPS reduced to 0 transactions continued to complain.
Monitoring found: TPS from 700 down to 0, disk full after the transaction continued to complain total failure 3014, trading performance fluctuations
Recovery means: The impact of this occurs mainly from two places, if only OB and OB-related components are installed on the machine, the full disk is either a data file or a log file, and if it is a data file, then there is nothing to do but to augment the resource, if it is a log file, locate the corresponding directory, and delete the redundant log files.
4, the database has a lot of bad SQL, database exception
Manufacturing phenomenon: A normal business operation, this time concurrency of a bad SQL business.
Expected impact: Normal business TPS is down and there may be a failure phenomenon.
Monitoring discovery: Batch launched 1000 database concurrent operations, transaction TPS immediately reduced from 700 to 150, while the batch query timeout failed.
Recovery means: SQL Rotten is a common problem of the database, we can according to show processlist; View the current database is currently executing SQL, find out the execution time is relatively long. It is then optimized and then based on Oceanbase's own view: Gv$sql,gv$sql_audit to see what the database has done before slow SQL optimizes it.
5, the business suddenly increased, and expansion observer, database anomaly
Manufacturing phenomenon: A business suddenly increased by 50 user volume. Thereafter adjust the tenant's Cup:alter resource pool xxx_poll unit c12_unit;.
Expected impact: The amount of concurrent data will generate TPS and response time rise, the expansion of resources will occur jitter, TPS rise, RT Decline
Monitoring discovery: Increased user TPS from 700 to 800, transaction response time from 0.070 to 0.095 seconds. The capacity of the expansion resource takes a little jitter and the TPS recovers to 800,rt time to 0.095 seconds.
Recovery means: None
6, Analog database server network failure.
Manufacturing phenomenon: Observer 2882 and 2881 ports off
To execute a command:
Iptables-a input-i bond0-p TCP--dport 2882-j drop
iptables-a input-i bond0-p tcp--dport 2881-j drop
ipta Bles-a output-p TCP--dport 2882-j drop
iptables-a output-p tcp--dport 2881-j drop
Expected phenomenon: TPS will come down first, then return to normal, the fair error.
Monitoring found: TPs first fell in half, 1 minutes after the return to normal, the transaction error continued for 4 minutes.
Recovery means:
iptables-d input 1
iptables-d input 1
iptables-d OUTPUT 1
iptables-d Output 1
Click to have a surprise