Bash script for Batch Job parallelization

Source: Internet
Author: User

When running a job in Linux, you often encounter the following situations: a large number of jobs need to be run, and the time required to complete each job is not very long. if we run these jobs in serial mode, it may take a long time. If we run these jobs in parallel mode, the running time can be greatly reduced. furthermore, most of the current computers are in multi-core architecture. to make full use of their computing capabilities, parallel computing is required. to sum up the information seen on the Internet, we can use the following methods to achieve Batch Job parallelization using the Bash script. note: The process and thread are not distinguished, and parallel and concurrent tasks are not distinguished.

1. Use the GNU paralle Program

Parallel is a program specially used by GNU for parallelization. It is suitable for Simple Batch Job parallelization. you do not need to write scripts when using parallel. Simply add parallel to the original command. therefore, if you can use paralle to parallelize your job, use it first. for more information about paralle, see its official documentation.

2. The simplest method of Parallelism: & + wait

Using Bash's background running & wait functions, you can achieve the simplest Batch Job parallelization.

The following code takes about 10 seconds for serial execution:

Change to the following simple parallel code. Ideally, you can compress the running time to about 3 seconds.

3. Parallel Method with controllable process count (1): Simulation queue

It is not difficult to run multiple processes simultaneously using Bash scripts. The main problem is how to control the number of processes running simultaneously. the process count cannot be controlled when the preceding simple parallelization method is used, so the function is limited, because the number of jobs we need to run is much larger than the number of available processors, in this case, if a large number of jobs are run in the background at the same time, the running speed will be slow and the parallel efficiency will be greatly reduced. A simple solution is to simulate a queue with a maximum number of processes. Process PID is used as the queue element to check the queue at a certain time. If a job is completed in the queue, then add the new job to the queue. this method can also avoid unnecessary waiting because different jobs are time-consuming. the following is an implementation of code rewriting on the Internet. for more practical code, see the original article.

A more concise method is to record the PID to the array and check whether the PID exists to determine whether the job is running. The following can be implemented:

3. parallelization method with controllable process count (2): Named Pipe

The preceding parallel method can also be implemented using named pipelines. Named pipelines are a method for communication between processes in Linux, also known as fifo (first in first out) file. the specific method is to create a fifo file as a process pool, which stores a certain number of "tokens ". the job running rules are as follows: all jobs receive tokens in sequence; each job receives a token from the process pool before running, and then returns the token after completion; when there is no token in the process pool, the job to run can only wait. this ensures that the number of jobs running at the same time is equal to the number of tokens. the preceding simulated queue method uses PID as the token.

According to the information I have viewed, this method is most discussed on the Internet. the implementation is also concise, but it requires a lot of Linux knowledge to understand its code. the following is the sample code I have rewritten and its comments.

Note:

(1) exec6 <> $ Pfifo is very important. If this statement is not provided, the program will be blocked when writing data to $ Pfifo, until read reads the data in the file. after executing this statement, you can write data to the file continuously during the program running without blocking, and the data will be saved for read.

(2) When $ Pfifo has no data, read cannot read the data, and the process will be blocked in the read operation until a child process stops running and writes a row to $ Pfifo.

(3) The core execution part can also be implemented as follows:

The difference between {} and () is whether the shell will derive sub-Processes

(4) This method cannot be used in current Cygwin (version 1.7.27) because it does not support bidirectional named pipelines. someone mentioned a solution that uses two file descriptors to replace a single file descriptor, but this method is not successfully tested.



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.