How to improve performance in data processing-introduce concurrency but avoid synchronization-php Tutorial

Source: Internet
Author: User
How to improve performance in data processing-introduce concurrency but avoid synchronization background

As long as there is a database, there will be a need to process data in batches in the background, such as data table backup, regular cleaning, data replacement, and data migration. for batch processing, it often involves a large number of queries, filters, classifications, and aggregate computing. in batch scripts, direct queries to databases tend to have low performance, an online accident even occurs in the database lock table due to a large SQL statement. Therefore, it is generally used to export data to a file first, calculate the data on the file, and then import the data. for example:

1. run the SQL statement in mysql-e "select * from table"> output.txt to export the result to the file;

2. for files, perform aggregation, filtering, replacement, and other calculations in various ways, and finally generate the desired format;

3. release the output file or use the load data command to import it to the database;

Because it is only a one-time batch query of the database to export to files, and then calculate the files, rather than querying the database each time, it saves a lot of network I/O costs, thus improving the processing speed.

However, after an exported file is obtained, if the file is too large or the computing logic is complex, for example, a large number of calls consume CPU regular matching and aggregate computing, the processing of a single thread will take a lot of time, and concurrent processing can be introduced at this time, so that the CPU, memory, IO, network and other resources of the machine can be fully utilized, greatly reducing the processing time.

Multithreading is introduced. multiple input files are split. each small file starts a processing thread.

HADOOP's MAP-REDUCE approach is to first split the file into small-part files, then calculate each part, and finally aggregate the results of each part, however, due to various reasons such as HADOOP scheduling and cluster stability, for file processing in MB size, it may be found that the processing speed is very slow, sometimes even slower than the processing speed of a single thread, changing a single-host single-thread to multiple threads often results in surprising improvements.

The intuitive method is to use the main thread to read a single large file input, and then allocate the read result to the subthread for processing, and then integrate the main thread, in this way, because multiple threads share the IO of a single file, you need to add a file synchronization mechanism. In the end, you will find that the performance bottleneck lies in the reading and synchronization of this single file.

You can partition large files into small files and allocate each file to a single thread for separate processing to avoid resource synchronization between threads, each thread can use a separate CPU core, memory unit, and file handle to achieve the fastest processing speed.

To use this method, follow these steps:

1. use SHELL to split the input file into a predetermined number of threads and store them in a directory;

2. Using the directory path of the input file as the parameter, the programming language JAVA/PYTHON reads all the files in the directory and starts a processing thread for each file for processing;

3. SHELL outputs all files in the directory and uses cat file *> output_file to obtain the final calculation result.

Shell

Splits the input file into multiple small files, starts multithreading for processing, and outputs the result file.

Function run multitask () {# enable multiple asynchronous threads splits count = 20 # total number of input files sourcefile linescount =

cat ${input_file} | wc -l
# Calculate the number of split files. split filelines count = $ ($ sourcefile linescount/$ splits count )) # split files-l $ splitfile linescount-a 3-d $ {input file }$ {input

Dir}/inputFile _

# Run the JAVA program $ JAVA_CMD-classpath $ jar_path "net. crazyant. backTaskMain "" $ {input_dir} "" $ {output_dir} "" $ {output_err_dir} "# Merge files cat $ {output_dir}/* >$ {output_file}

}

Run multitask

# Split the input file into multiple small files, start multithreading for processing, and output the result file # function run_multi_task () {# enable multiple asynchronous threads SPLITS_COUNT = 20 # total number of input files source_file_lines_count = 'cat $ {input_file} | wc-L' # calculate the number of split files. split_file_lines_count = $ ($ source_file_lines_count/$ SPLITS_COUNT )) # split the file-l $ split_file_lines_count-a 3-d $ {input_file }$ {input_dir}/inputFile _ # run the JAVA program $ JAVA_CMD-classpath $ jar_path "net. crazyant. backTaskMain "" $ {input_dir} "" $ {output_dir} "" $ {output_err_dir} "# Merged File cat $ {output_dir}/* >$ {output_file} run_multi_task

Note that when splitting a file, you cannot use split to split the file by size, because the rows in the input file will be truncated;

The corresponding JAVA program reads the file list in the folder, and each file starts a separate thread:

Java

Public class BackTaskMain {public static void main (String [] args) {String inputDataDir = args [1]; String outputDataDir = args [2]; String errDataDir = args [3]; file inputDir = new File (inputDataDir); File [] inputFiles = inputDir. listFiles (); // record the List of enabled threads
 
  
Threads = new ArrayList
  
   
(); For (File inputFile: inputFiles) {if (inputFile. getName (). equals (". ") | inputFile. getName (). equals (".. ") {continue;} // for each inputFile, generate the corresponding outputFile and errFile String outputSrcLiceFpath = outputDataDir +"/"+ inputFile. getName () + ". out "; String erroutputfpath = errDataDir +"/"+ inputFile. getName () + ". err "; // create Runnable BackRzInterface backRzInterface = new BackRzInterface (); backRzInterface. setInputFilePath (inputFile. getAbsolutePath (); backRzInterface. setOutputFilePath (outputSrcLiceFpath); backRzInterface. seterroutputfpath (erroutputfpath); // creates a Thread and starts the Thread singleRunThread = new Thread (backRzInterface); threads. add (singleRunThread); singleRunThread. start () ;}for (Thread thread: threads) {try {// Use thread. join (), waiting for all threads to finish executing the thread. join (); System. out. println (thread. getName () + "has over");} catch (InterruptedException e) {e. printStackTrace () ;}} System. out. println ("proccess all over ");}}
  
 
Public class BackTaskMain {public static void main (String [] args) {String inputDataDir = args [1]; String outputDataDir = args [2]; String errDataDir = args [3]; fileinputDir = new File (inputDataDir); File [] inputFiles = inputDir. listFiles (); // record the List of enabled threads
 
  
Threads = new ArrayList
  
   
(); For (FileinputFile: inputFiles) {if (inputFile. getName (). equals (". ") | inputFile. getName (). equals (".. ") {continue;} // for each inputFile, generate the corresponding outputFile and errFile String outputSrcLiceFpath = outputDataDir +"/"+ inputFile. getName () + ". out "; String erroutputfpath = errDataDir +"/"+ inputFile. getName () + ". err "; // create Runnable BackRzInterfacebackRzInterface = new BackRzInterface (); backRzInterface. setInputFilePath (inputFile. getAbsolutePath (); backRzInterface. setOutputFilePath (outputSrcLiceFpath); backRzInterface. seterroutputfpath (erroutputfpath); // creates a Thread and starts the Thread ThreadsingleRunThread = new Thread (backRzInterface); threads. add (singleRunThread); singleRunThread. start () ;}for (Threadthread: threads) {try {// Use thread. join (), waiting for all threads to finish executing the thread. join (); System. out. println (thread. getName () + "has over");} catch (InterruptedException e) {e. printStackTrace () ;}} System. out. println ("proccess all over ");}}
  
 

In this way, a large file is split into small files, multiple threads are started, each thread processes a small file, and then the results of each small file are aggregated to get the final output, performance is greatly improved.

If there are dependent resources, you can copy, split, and clone resources by thread to prevent dependent resources from becoming performance bottlenecks.

In the code above, BackRzInterface is the Runnable object to be used when each thread starts. you can see that it is created in every new way:

// Create RunnableBackRzInterface backRzInterface = new BackRzInterface ();

In this way, the BackRzInterface on which each processing thread depends is independent and the use of this Runnable code will not cause synchronization problems.

If you need to use external resources in multi-threaded processing, it is best to make each thread use its own exclusive resources independently. if there is no conflict between them, you can achieve maximum concurrent processing.

Other examples, such:

  • When multiple threads use dictionary files, the method is to first copy multiple dictionary files. each thread uses its own dictionary to avoid concurrent and synchronous access to the dictionary;
  • If multiple threads require uniform ID sending, they can calculate the number of lines in each input file in advance, and then generate the ID range required by the first thread and the ID range required by the second thread in sequence, these different ID ranges can also generate different files, so that each thread uses its own independent ID resources, avoiding multiple threads accessing a single ID sending service at a single time, this may cause the issue to become a performance bottleneck;
  • Multiple threads depend on the same Service. if new objects can be added each time, and Bean is managed in Spring, @ Scope ("prototype") is added to the Service "), or clone the object to get a new object, so that each thread uses its exclusive object.
  • Use functional programming whenever possible. no side effects are required for each function. do not modify input parameters. return results can only be returned through return to avoid code synchronization conflicts;

Using these similar methods, you can obtain different copies of shared resources that may need to be accessed synchronously each time by means of replication and sharding. each thread can access its own copy separately to avoid synchronous access, finally, the optimal performance is achieved.

The ultimate way to avoid synchronization: use multi-process for resource isolation

If you split the file into multiple copies, the dependent ID, Dictionary, and other resources are also provided in multiple copies, but you find that there are unsolvable code-level synchronization in the code, what should you do?

Compared with trying to solve the synchronization problem in code, the performance difference between multithreading and multi-process is minimal. we all know that threads use process resources, which leads to competition between threads, however, for processes, hardware resources such as CPU and memory are completely isolated. at this time, the program runs in multiple processes instead of multithreading, which can improve performance.

For some languages that support poor multithreading, such as PHP, the speed of using this multi-process computing method is no worse than the JAVA and PYTHON languages that support multithreading:

Shell

Number of files to be split, that is, the number of multi-processes to be started

SPLITS_COUNT = 20

Input splitsdir = "$ {input dir} splits" output splitsdir = "$ {output dir} splits"

Number of input file lines

Source filelines_count =

cat ${input_file} | wc -l
Number of lines to be split for each file = total number of lines divided by the number of files to be split (that is, the number of corresponding processes)

Split filelines count = $ ($ sourcefile linescount/$ {SPLITS_COUNT }))

Execute split. Note that it is better to use-l to split rows.

Split-l $ split filelines count-a 3-d $ {inputfile }$ {input splitsdir}/inputfile _

Process idx = 1 for fname in $ (ls $ {inputsplits dir }); do inputfpath =$ {input splitsdir}/$ fname ouput fpath =$ {outputsplits dir}/$ fname # execute all processes in the background php "/php/main. php "" $ {inputfpath} "" $ {ouput fpath} "& (processidx ++) done

Wait until all background processes are executed.

Wait

Merge files

Cat $ output splitsdir/* >$ {output_file}

# Number of files to be split, that is, the number of multi-processes to be started SPLITS_COUNT = 20 input_splits_dir = "$ {input_dir} _ splits" output_splits_dir = "$ {output_dir} _ splits" # Number of input file lines limit = 'cat $ {input_file} | wc-l '# Number of lines to be split for each file = total number of lines divided by the number of files to be split (that is, the number of corresponding processes) split_file_lines_count =$ ($ source_file_lines_count/$ {SPLITS_COUNT}) # execute splitting, note: Here we use-l for row-level split to better split-l $ split_file_lines_count-a 3-d $ {input_file }$ {input_splits_dir}/inputfile _ process_idx = 1for fname in $ (ls $ {input_splits_dir }); do input_fpath =$ {input_splits_dir}/$ fname ouput_fpath =$ {output_splits_dir}/$ fname # execute all processes in the background php "/php/main. php "" $ {input_fpath} "" $ {ouput_fpath} "& (process_idx ++ )) done # wait for all background processes to finish running wait # Merge the files cat $ output_splits_dir/* >$ {output_file}

In the above code, the shell & symbol can be used to start multiple processes at the same time in the background. with the wait syntax, the Thread. join feature of multiple threads can be implemented, waiting for the execution of all processes to end.

Summary

For data processing between a single machine and cluster computing, concurrent processing is the most suitable for the size and complexity of input files, but concurrent synchronization can reduce the performance of multithreading, in this case, you can use methods such as copying and splitting input files and copying and splitting dependent resources to achieve each thread to process its own exclusive resources and maximize the computing speed. Some unavoidable code synchronization conflict logics can be degraded into multi-process data processing. with the support of SHELL background processes, process-level resource exclusive can be achieved, greatly improving the processing performance.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.