MySQL's weird sync problem-repeat a relay-log

Source: Internet
Author: User

The weird sync problem with MySQL

Recently encountered a strange MySQL synchronization problem, after multi-analysis and location found incredibly due to backup caused, very wonderful, hereby record the whole problem analysis and positioning process.

Phenomenon

Co-worker expansion of a Slave is not in sync, the specific phenomenon is sbm=0, but exec_master_log_pos execution of the location and Read_master_log_pos completely not, and the server itself CPU and IO are consumed very badly.

--total-cpu-usage-----dsk/total--net/total----paging-----system--usr sys idl Wai hiq siq| Read writ| Recv send|inchOut |intCSW1 0  About 0 0 0|8667b 862k|0 0|0 0|536 8421 3  the 2 0 1|0102m| 112k 12k|0 0|21k 50k1 3 94 2 0 1|0101m| 89k 5068b|0 0|21k 49k1 3  the 2 0 1|0103m| 97k 5170b|0 0|21k 50k1 4  the 2 0 1|0103m| 80k 15k|0 0|21k 50k1 4  the 2 0 1|0102m| 112k 8408b|0 0| 21k 50k

Analysis

Because SBM does not really represent the state of synchronization, so at first we think there is a larger event in the execution process, so lead to sbm=0 but POS location is not correct, but after observing a section, the phenomenon has not changed, and Relay_log_file always on the same file.

In order to find the cause of this phenomenon, we use strace to analyze. Because of the server phenomenon is very large disk write operations, so first use iotop to locate write a large number of thread ID, and then through the strace-p to analyze.

The information collected through Strace is as follows:

Read -,"376bin>303275t1721325322bg000k0000040005.5.31-"...,8192) = MaxLseek ( $,0, seekset) =0Write( $,"./relay-bin.000914n107nmysql-bin"..., -) = -Read ( -,"",8042) =0Close ( -) =0Lseek ( $,0, seekset) =0Write( $,"./relay-bin.000914n107nmysql-bin"..., -) = -unlink("./relay-bin.rec") = -1ENOENT (No suchfileor directory) open ("./relay-bin.rec", ordwr| Ocreat,0660) = -Lseek ( -,0, seekcur) =0Fsync ( -) =0Lseek ( *,0, seekset) =0Read ( *,"./relay-bin.000914n./relay-bin.0"...,8192) =437Lseek ( *,0, seekset) =0Write( *,"./relay-bin.000914n./relay-bin.0"...,437) =437Lseek ( *,437, seekset) =437Read ( *,"",8192) =0Lseek ( *,0, seekend) =437Fsync ( *) =0Lseek ( -,0, seekset) =0Close ( -) =0unlink("./relay-bin.rec") =0Lseek ( *,0, seekset) =0Read ( *,"./relay-bin.000914n./relay-bin.0"...,437) =437Open ("./relay-bin.000914", ordonly) = -Lseek ( -,0, seekcur) =0Read ( -,"376bin>303275t1721325322bg000k0000040005.5.31-"...,8192) = MaxLseek ( $,0, seekset) =0Write( $,"./relay-bin.000914n107nmysql-bin"..., -) = -Read ( -,"",8042) =0

From the above can be seen mysqld in the constant read and write 35 , 36 38 these 3 files. Through the content, we can probably see that these 3 files should be Relay-bin.index, relay-bin.000914 and Relay-log.info of the 3 files. But how do you determine the 3 files? This time need to lsof, through the lsof we can find each PID in Proc will have an FD directory, in this directory is the link of each file, through the FD number can directly determine the specific file

cd/proc/$pid/Fdll-HLR-X------1Root root -Jan to  A: - 0/dev/NULLL-WX------1Root root -Jan to  A: - 1/data1/mysql3361/error.loglrwx------1Root root -Jan to  A: - Ten/data1/mysql3361/iblogfile1lrwx------1Root root -Jan to  A: -  One/data1/mysql3361/IBLOGFILE2LR-X------1Root root -Jan to  A: -  A/data1/mysql3361/relay-bin.reclrwx------1Root root -Jan to  A: -  -/data1/mysql3361/ibf0ovt9 (deleted) lrwx------1Root root -Jan to  A: -  -Socket:1042425539lrwx------1Root root -Jan to  A: -  the/data1/mysql3361/relay-bin.indexlrwx------1Root root -Jan to  A: -  -Socket:1042425540lrwx------1Root root -Jan to  A: -  -/data1/mysql3361/sngdb/sngrankscore6.ibdlrwx------1Root root -Jan to  A: -  -/data1/mysql3361/mysql/host. MYILRWX------1Root root -Jan to  A: -  +/data1/mysql3361/mysql/host. Mydl-WX------1Root root -Jan to  A: - 2/data1/mysql3361/error.loglrwx------1Root root -Jan to  A: -  -/data1/mysql3361/mysql/user. MYILRWX------1Root root -Jan to  A: -  +/data1/mysql3361/mysql/user. MYD

Positioning

At this point, we are very sure that this is due to the 3 files, then the specific look at the 3 files what is the exception? After each check found that the Relay-bin.index file is very unusual, the same relay-bin in index has two lines, this will be the cause of the problem?

cat relay-bin.index. /relay-bin. 000914 . /relay-bin. 000914 . /relay-bin. 000915 . /relay-bin. 000916 Cat Relay-log. Info . /relay-bin. 000914 107 MySQL-bin. 000724 107

After the human flesh removed a line, found that everything became normal, then it seems that the cause of the problem is that there are two lines of the same record in the Relay-bin.index, but why this phenomenon is caused?

Re-analysis

At this time pure thought has no use, had to look for the answer from code, mainly see slave.cc and log.cc.

In particular, when MySQL starts slave, it reads the corresponding filename and Pos from the Relay-log.info and then goes to execute relay log event, which is deleted when the execution is complete, and MySQL uses Reinit_ Io_cache resets the file pointer to the Relay-log.index file and then uses the Find_log_ The code inside the POS mcmcmp from the first line of Relay-log.index to confirm the desired offset, and then back to the same relay-log, so the death cycle was produced.

int Mysqlbinlogfindlogpos (Loginfo linfo, const char logname,                bool needlock) (void) Reinitiocache (&indexfile, Readcache, (myofft) 0, 0, 0);  for (;;)     if (!logname    (Lognamelen = = Fnamelen-1 && Fullfnamelognamelen = = ' n ' &&     !memcmp (Fullfname, Fulllogname, Lognamelen))         dbugprint ("Info", ("Found log File Entry"));      fullfnamefnamelen-1]= 0;           Remove last n      linfo->indexfilestartoffset= offset;      Linfo->indexfileoffset = Mybtell (&indexfile);      Break;}}}}  

  

In simple terms:

    1. Sql-thread reads Relay-log.info and starts execution from 000914.
    2. After execution, read the next relay-log from index and find the 000914
    3. The pointer is then positioned to the first line of the index file
    4. And then the cycle of death is created.
    5. As for the following 000915 files are generated by Io-thread, and Sql-thread Independent, can be ignored

At this point, there is a sbm=0 phenomenon, and the cause of the large IO consumption has been determined, due to duplicate rows in the Relay-bin.index file. But why does it produce the same two rows of data?

Reason for analysis

First of all, this is absolutely impossible to be human idle manual VI of the convulsions.

Second, when you manually perform the flush logs, the relay-log automatically adds a.

Again, it is known that MySQL will automatically perform operations like flush logs when it is restarted, and will relay-log automatically create one down.

Combined with the above information, and in the continuous tracing of colleagues, the final discovery is due to our backup.

First, we will make the database cold standby every day, using the Rsync method.

Second, we archive the log every day and execute flush logs on a timed basis.

When the above two operations are encountered together, a flush logs operation is performed between the Relay-log and the relay-bin.index after copying. This results in Relay-log and Relay-bin.index, which results in a duplicate row in the index file resulting from using this backup for expansion.

Summary improvements

Finally, so wonderful, so small probability of a problem finally found the root, before, really want to break the head also can't think out exactly what is going on. Thanks to @zolker @ Big free live @zhangyang8 in-depth investigation and tracing, without perseverance, we will not be traced to this point.

Then there is the improvement:

    1. Enhance the consistency of files for backup and avoid them at the root.
    2. To strengthen the compatibility of the restore program, because the index file for the expansion of the slave does not make sense, so if you find duplicate rows can be directly placed empty index file.

MySQL's weird sync problem-repeat a relay-log

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.