I have a program to capture the picture and package it into a zip file.
Currently, a single thread is used because it is too slow. So I want to use multiple threads for concurrent capturing.
However, I don't know when to package a zip file.
Could you give me some advice?
Reply to discussion (solution)
Write the disk only after it is captured.
You maintain a target list in the program. when all the objects in the list exist, you can package them.
Multi-thread synchronization and Semaphore synchronization.
How can I capture images? can I share them?
Multi-thread synchronization and Semaphore synchronization.
Can you give me a clearer explanation.
Write the disk only after it is captured.
You maintain a target list in the program. when all the objects in the list exist, you can package them.
I have thought about this method. I think it is quite troublesome. Because there are hundreds of topics circulating on my side, one topic contains seven or eight images.
Each time the image address in the content is obtained through regular expressions, it is crawled to the server, and then the image size is pressed.
If you need to record it, it should be a loop of concurrent processes. After the execution is complete, enter the next loop.
How can I capture images? can I share them?
It is not difficult to capture images
The simplest is file_get_contents.
Do you not record which image is the topic?
Do you not record which image is the topic?
No associated Image record table
Because the content may be updated frequently, it doesn't make much sense to add it now.
Create a folder for each topic. every time you use regular expression matching, make statistics on several images and write the data to a dedicated file in the folder, for example. number!
Write another package program, scan the files in these folders, and compare them with the number of files in special files. if the number of files is equal, package them directly and delete the folder!
PV operation .. You need to maintain a semaphore. It is simply a variable. If you know how many files to capture before running the program, the value is determined at the beginning. for example, if you know that you want to capture 5 images, the value of this variable is 5, when each sub-process completes an image capture action, the variable is reduced by one. when the value is 0, all images are captured.
If you do not know the number of images to capture at the beginning, the semaphore needs to be dynamically maintained, and other semaphores need to be added to control the synchronization between processes. For example, the semaphore may be 0, but there are still processes analyzing whether there are images to capture. because the semaphore is 0, the program exits directly. This situation should also be considered.
In addition, how do you use multithreading? PHP does not support multithreading.
Apache has multi-threaded components
Also, use iframe.
In addition, how do you use multithreading? PHP does not support multithreading.
The curl after php 5. + supports multiple threads. do you need to package each download?
Create a folder for each topic. every time you use regular expression matching, make statistics on several images and write the data to a dedicated file in the folder, for example. number!
Write another package program, scan the files in these folders, and compare them with the number of files in special files. if the number of files is equal, package them directly and delete the folder!
This method is quite good. However, it may be better to use tables because you can remember more. For example, the captured image, capture status, and completion time are triggered.
Let me briefly describe the situation here.
A special app on the software side has such a function.
You can download offline packages to view topics.
The format agreed with the software is zip.
The zip file structure is
The folder (topic id) corresponds to multiple topic pictures and one topic homepage.
Multiple topics generate multiple folders named topic IDs.
My current job is to generate folder data for these topics and then package them into zip files. Provide the software for download.
The topic image is captured by matching the image address in the topic content.
The current problem is caused by hundreds of topics. A topic contains eight or nine images;
Slow crawling speed.
Or I need a table to record the number of images matched by regular expressions in each loop. This quantity is reduced if each capture succeeds or fails.
Then, query the number while in the program, and enter the next topic until it is zero.
In addition, how do you use multithreading? PHP does not support multithreading.
You can use fsockopen to trigger the link of the screenshot program. The program continues running without waiting for the link to return.
Curl is recommended for crawling, which is simple and efficient.
Curl can directly write captured data into a file.
Curl_multi can mount multiple curls at the same time
Curl_getinfo can obtain relevant work information (if needed)
Determine that each member of curl_multi ends, and the capture ends.
This is also in line with your intention not to save progress information
Curl is recommended for crawling, which is simple and efficient.
Curl can directly write captured data into a file.
Curl_multi can mount multiple curls at the same time
Curl_getinfo can obtain relevant work information (if needed)
Determine that each member of curl_multi ends, and the capture ends.
This is also in line with your intention not to save progress information
I can add multiple tables to record the status of the moderator, but how should I design this table.
A package is associated with multiple topics. multiple packages can be concurrently used to generate zip files.
I thought about this table structure.
Proc structure
Id Image address package id execution status (1 is captured) topic id
First, clear the number of records of the current generated package id of proc.
Loop topics
Match the image address and insert a record into the table.
If the capture succeeds or fails, the execution status is updated.
Finally, there is a loop.
Until a package in this table is queried, the number of records in the execution status of images in a topic is equal to the number of images in regular match, and the next topic cycle is entered.
I'm not sure whether I can clearly describe it. there are other operations in this generation process. at the beginning, I only described some key operations.
The design of my proc table is not reasonable.
I'm waiting for you to reply.
I have written the program for testing in the afternoon according to this idea.
After testing, you can make improvements if you find any deficiencies.
If you find that the problem cannot be solved, ask a question.
Corrected a script that used fsockopen to trigger image capturing one afternoon. If fgets is not enabled, it cannot be triggered. Local trials are acceptable. I don't know why.
Then I searched the internet
* ** Curl multithreading ** @ param array $ array parallel URL * @ param int $ timeout time * @ return array */function curl_http ($ array, $ timeout) {$ res = array (); $ mh = curl_multi_init (); // create multiple curl handles $ startime = getmicrotime (); foreach ($ array as $ k => $ url) {$ conn [$ k] = curl_init ($ url); curl_setopt ($ conn [$ k], CURLOPT_TIMEOUT, $ timeout); // Set the timeout value curl_setopt ($ conn [$ k], CURLOPT_USERAGENT, 'mozilla/5.0 (compatible; MSIE 5.01; Windows NT 5.0) '); curl_setopt ($ conn [$ k], CURLOPT_MAXREDIRS, 7); // HTTp targeting level curl_setopt ($ conn [$ k], CURLOPT_HEADER, 0 ); // do not add headers here. add the block efficiency curl_setopt ($ conn [$ k], CURLOPT_FOLLOWLOCATION, 1); // 302 redirect curl_setopt ($ conn [$ k], CURLOPT_RETURNTRANSFER, 1); curl_multi_add_handle ($ mh, $ conn [$ k]);} do {$ mrc = curl_multi_exec ($ mh, $ active); // when no data exists, active = true} while ($ mrc = CURLM_CALL_MULTI_PERFORM); // when While receiving data ($ active and $ mrc = CURLM_ OK) {// when there is no data or the request is paused, active = trueif (curl_multi_select ($ mh )! =-1) {do {$ mrc = curl_multi_exec ($ mh, $ active);} while ($ mrc = CURLM_CALL_MULTI_PERFORM );}} foreach ($ array as $ k =>$ url) {curl_error ($ conn [$ k]); // $ res [$ k] = curl_multi_getcontent ($ conn [$ k]); // Get the returned information $ header [$ k] = curl_getinfo ($ conn [$ k]); // return header information curl_close ($ conn [$ k]); // Close the language handle curl_multi_remove_handle ($ mh, $ conn [$ k]); // release resources} curl_multi_close ($ mh); $ endtime = getmicrotime (); $ diff_time = $ endtime-$ startime; return array ('Diff _ time' => $ diff_time, 'Return '=> $ res, 'header' => $ header );} // calculate the current time function getmicrotime () {list ($ usec, $ sec) = explode ("", microtime (); return (float) $ usec + (float) $ sec );}
Tested. But I don't know why file_get_contents is faster than this for local testing.
While ($ active and $ mrc = CURLM_ OK) {// when there is no data or the request is paused, active = true if (curl_multi_select ($ mh )! =-1) {do {$ mrc = curl_multi_exec ($ mh, $ active);} while ($ mrc = CURLM_CALL_MULTI_PERFORM );}}
In addition, I do not quite understand the role of this sentence. The link has been triggered above.
Is this part used to check when it is completed?
do { $mrc = curl_multi_exec($mh, $active); } while ($mrc == CURLM_CALL_MULTI_PERFORM);
This is the same as the above
You used the curl_multi_select function, but you don't know what it is ,....
Manual description
Curl_multi_select -- Get all the sockets associated with the cURL extension, which can then be "selected"
Google said
Curl_multi_select-get all the related plug-in and curl extensions so that you can "select"
Let's take a look at this method today. You do not need to record the number of images in the table.
The number of images recorded in the table.
Thank you for understanding. Close the post!