Record a copy of a file to the Glusterfs stuck solution process

Last Update:2017-11-04 Source: Internet

Author: User

Tags glusterfs gluster

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed. # Introduction We have a distributed service that is stored as Gluster FS and requires a lot of read and write files. In the Company development environment, the test environment is normal, on-line environment, high imitation environment but often copy files to Gluster fs stuck problem (file if 200m~5g size, probability around 3~4%, the file has been copied, the source file and the target file MD5 consistent, The card is located close to the target file handle. ）。 "Func CopyFile (src, dest String) (copiedsize int64, err error) {copiedsize = 0srcFile, err: = OS. Open (SRC) if err! = Nil {return copiedsize, Err}defer srcfile.close () destfile, err: = OS. Create (dest) if err! = Nil {return copiedsize, Err}defer destfile.close ()//card in this return IO. Copy (DestFile, Srcfile)} ' stuck Goroutine information example: ' Goroutine 109667 [Syscall, 711 Minutes]:syscall. Syscall (0x3, 0xf, 0x0, 0x0, 0xafb1a0, 0xc42000c150, 0x0)/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5syscall. Close (0xf, 0x0, 0x0)/usr/local/go/src/syscall/zsyscall_linux_amd64.go:296 +0x4aos. (*file). Close (0xc420344f00, 0x455550, 0xc4203696d0)/usr/local/go/src/os/file_unix.go:140 +0x86os. (*file). Close (0xc4200289f0, 0x1b6, 0xc4200289f0)/usr/local/go/src/os/file_unix.go:132 +0x33common/utils. CopyFile (0xc42031eea0, 0x5d, 0xc420314840, 0x36, 0x10ce9d94, 0x0, 0x0) ... "' The last/usr/local/go/src/syscall/asm_linux_amd64.s before and after the 18th line of code is the following" ' Text Syscall (SB), Nosplit,$0-56callruntime Entersyscall (SB)//card at the beginning of system call MOVQA1+8 (FP), dimovqa2+16 (FP), simovqa3+24 (FP), Dxmovq$0, r10movq$0, r8movq$0, r9movqtrap+0 (FP), ax//syscall Entrysyscallcmpqax, $0xfffffffffffff001jlsokmovq$-1, r1+ (FP) movq$0, r2+40 (FP) Negqaxmovqax, err+48 (FP) callruntime Exitsyscall (SB) Retok:movqax, r1+32 (FP) movqdx, r2+40 (FP) Movq$0, err+48 (FP) callruntime Exitsyscall (SB) RET ' # Solution process due to development environment, Gluster FS for test environment is 3.3.2, online environment, high imitation environment gluster The FS version is 3.7.6, and the first thing to guess is the issue that might be caused by a version inconsistency. So the first is whether there is a problem with the Gluster FS version, the deployment of glusterfs hardware and software whether there is a problem starting, but always find no real reason. At this time the company process is a hindrance, because the reason is based on empirical speculation, even if a little bit of code, to be tested, to find a number of leadership signature, the worst case is a day has not gone through a process. Finally, really helpless, to the leader to apply for operating part of the high-imitation environment permissions. Well, finally, I can do my fist. # # for the first time, the use of time-out processing mechanism we think of, reference TensorFlow source code, in the Golang with reflect implementation of a similar./tensorflow/core/platform/cloud/retrying_ The utils.cc code. The basic principle is that some of the functions that close and so on may be stuck are city a goroutine to do, if the close phase stuck, more than a certain amount of time to continue to go down, anyway the files have been copied. The main code is as follows: "' type retryingutils struct {Timeout time. DurationmaxRetries Int}type callreturn struct {Error errorreturnvalues []reflect. Value}func Newretryingutils (timeout time. Duration, maxretries int) *retryingutils {return &retryingutils{timeout:timeout, Maxretries:maxretries}}func (R * Retryingutils) callwithretries (any interface{}, args ... interface{}) Callreturn {var callreturn callreturnvar retries intfor {callreturn.error = Nildone: = Make (chan int, 1) go func () {function: = reflect. ValueOf (any) Inputs: = Make ([]reflect. Value, Len (args)) for I, _: = Range args {inputs[i] = reflect. ValueOf (args[i])}callreturn.returnvalues = function. Call (inputs) did <-1} () Select {case <-done:return callreturncase <-time. After (r.timeout): Callreturn.error = errtimeout}retries++if retries >= r.maxretries {break}}return CallReturn} ' Example of invocation: "' Newretryingutils (time. SECOND*10, 1). Callwithretries (FD. Close) "is measured for two days, this method is not feasible. Since the task Goroutine continues to go down after it has jammed, there is a chance that the process will be defunct. (Note: Countless articles tell us that the defunct zombie process is not processing the child process exit information, and so on, this is only part of the zombie process, such as as long as goroutine stuck, if kill the process, the process will become defunCT) # # The second time, using the system command copy files we use the Linux CP command to copy files, measured two days, through. (The reason why it succeeds is not fully understood.) Because each time the pressure measurement needs to occupy two test beauty of a large amount of time, repeatedly measured, also not very good. If it is a development environment or test environment, it is good to do, we develop can self-test) # # # The third time, testing whether due to the multiple application of Read permission caused us to start by reading the Linux CP Command source code to find the reason. Discover COREUTILS/SRC/COPY.C's [Copy_reg] (https://github.com/coreutils/coreutils/blob/master/src/copy.c) function, The function is to copy ordinary files. Golang's OS. The CREATE function has requested a read permission more than the function by default. Create file in Copy_reg: ' ' int open_flags = o_wronly | O_binary | (x->data_copy_required?) o_trunc:0);d Est_desc = open (Dst_name, open_flags); creation file in ' Golang: ' ' Func create (name string) (*file, error) {return Op Enfile (Name, o_rdwr| o_create| O_trunc, 0666} "because I used to do remote process injection, I knew that requesting unnecessary permissions would result in an increase in the failure rate. Therefore, speculation may be the cause. Test results, unfortunately, no. # # Fourth, before the target file handle is closed, explicitly call the sync function as follows: ' ' Func CopyFile (src, dest String) (copiedsize int64, err error) {copiedsize = 0srcFile , err: = OS. Open (SRC) if err! = Nil {return copiedsize, Err}defer srcfile.close () destfile, err: = OS. Create (dest) if err! = Nil {return copiedsize, Err}defer func () {Destfile.sync ()//card in this destfile.close ()}return io. Copy (DestFile, SRCFILe)} ' stuck Goroutine example: ' ' Goroutine 51634 [Syscall, 523 Minutes]:syscall. Syscall (0x4a, 0xd, 0x0, 0x0, 0xafb1a0, 0xc42000c160, 0x0)/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5syscall. Fsync (0xd, 0x0, 0x0)/usr/local/go/src/syscall/zsyscall_linux_amd64.go:492 +0x4aos. (*file). Sync (0xc420168a00, 0xc4201689f8, 0xc420240000)/usr/local/go/src/os/file_posix.go:121 +0x3ecommon/utils. COPYFILE.FUNC3 () "More and more close to the truth, excited." Presumably the reason is that the write function simply writes the data to the cache and does not actually write to the disk (here is Glusterfs). The last sync () is stuck due to network or other reasons. # # Truth We found this article https://lists.gnu.org/archive/html/gluster-devel/2011-09/ Msg00005.html then the glusterfs configuration of each environment was checked, and we said that the Glusterfs configuration items of the online environment and the high imitation environment were performance.flush-behind off, the development environment and the test environment were on. A wise man must have a loss. Strict operations, occasional negligence is normal. Of course, the happiest thing is to finally solve the problem. "' gluster> volume Infovolume name:pre-volumetype:striped-replicatevolume ID: 3b018268-6b4b-4659-a5b0-38e1f949f10fstatus:startednumber of Bricks:1 x 2 x 2 = 4transport-type:tcpbricks:brick1:10.10. 20.201:/data/prebrick2:10.10.20.202:/data/prebrick3:10.10.20.203:/data/prebrick4:10.10.20.204:/data/preoptions Reconfigured:performance.flush-behind:off//here if on, Okdiagnostics.count-fop-hits: Ondiagnostics.latency-measurement:onperformance.readdir-ahead:on "' Related issue:https://github.com/gluster/ glusterfs/issues/341346 Times Click

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More