Summary of data migration tasks between TFs Clusters

Source: Internet
Author: User

In the past few days, we have been doing a data migration task between clusters. What we need to do is to give a task file, and each row in the file corresponds to a source: A migration task in the form of DeST (both source and DEST are file names) with tens of millions of tasks. The task is actually very simple. read each row, parse the source and DEST, read the source from the source cluster based on the given cluster information, and write it to the Dest of the target cluster.

Has gone through programming, executing tasks, analyzing logs, modifying programs, and then executing subtasks... after suffering for many times, I found that I had taken many detours in the process of handling tasks, because I was not aware of the complexity of the problem at the beginning, and the solution was too simple, here I will talk about some of my experiences in processing such large batches of tasks.

Batch task processing mainly involves the following two aspects:Performance and correctness.

The performance is measured by the task execution time, mainly throughOptimize the execution time of each taskAndParallel Processing. For this data migration task, each migration needs to read files from the source cluster and write the files to the target cluster. Because the read and write operations are performed through the client interface, therefore, the optimization is only to make proper use of the Client Cache during reading; the file name space is flat and there is no link between files, it cannot be optimized by reorganizing the task order. Because there is no dependency or connection between subtasks, the Migration task is very suitable for Parallel Processing in multi-process or multi-thread mode, with tasks, about 30 tasks can be processed per second (which can be understood as a random read and a random write time). The sequential processing time of a single process (single thread) is about 3.9 days, if 10 processes (threads) are simultaneously processed, the processing time is about 10 hours. To reduce the coding workload, I used a simple single-threaded read/write program. on the periphery, I divided the files to be processed into 10 sub-files and started 10 processes to process each sub-file separately. The reason is divided into 10 tasks: (1) Easy computing; (2) the processing time of 10 hours is acceptable, and the program runs out right after a sleep; (3) ten processes run in parallel, and the NIC is full.

Next, let's talk about the correctness. In fact, this is the most difficult to handle. There may be many reasons for the failure of each migration task, such as the invalid task description and source reading (here we can further segment it into multiple stages) failure: Write target fails (this can also be subdivided here). In case of errors, some errors are inevitable, such as the given source or target does not comply with the rules or the source file does not exist; some errors may be accidental. For example, you can read only a part of a file and re-execute the task to avoid them; some errors may be caused by bugs in programs (tool programs, or even client libraries), which can be avoided by modifying the program.

The log method can be used to differentiate the different situations mentioned above. In the past few days, it is very difficult to print logs, it is not just the log output point information (here we will not discuss the log level-by-level printing, but only the log Content output skills ).

First, logs that describe the error information must beQuickly locate the error location and causeIn the description of the log content, the log content must be veryTools such as grep and awkTo avoid writing log analysis tools. In addition, it is best to describe the wrong task information in detail in the log. In this way, you do not need to analyze the wrong task from the task file during second processing. When I first started writing the migration program, when the source is not read, only the source information is printed. When the result is to be processed twice, another Python script is written to analyze these items from the task file, if Source: DEST is used, you only need to get the wrong task from the log in the awk and re-process it.

For the data migration task mentioned in this article, I have summarized a set of effective log printing methods:

1.Differentiate error logs and result logsThe error log records the specific information when an error occurs. The result log is the specific result of task execution (success or failure ). Processing a task may correspond to many records (including the error logs printed in the used database), but each task only has one result log. The result log should preferably contain all the information described by the task;

2. SetTasks are divided into multiple stages.This information is constantly updated during task processing. If a task has an error, print the stage and error information (error code) of the task in the result log ), quickly locate the error location by using the stage information, and classify errors by error description (error code.

3. SetError logs and result logs are output to different locationsTo output error and result logs to stderr and stdout respectively, and redirect stderr and stdour to different files.

The result log allows you to quickly classify executed tasks by error information (grep) and process different errors. When you need to know the specific cause of the error, you can analyze the error log.

Finally, if you have high requirements on the correctness of the migrated data, you can perform a CRC or MD5 check during the migration process, or write an additional check tool to perform a comprehensive check, the nature of this job is very similar to that of migration.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.