Shell multithreaded Web page truncation
Linux two screenshots Tools Cutycapt and PHANTOMJS, after testing, Cutycapt screenshot is slower, but relatively stable, phantomjs screenshots faster, but there is a process of suspended animation state. Weigh the pros and cons and decide how to use the Cutycapt+shell script to screenshot:
webshot.sh
#/bin/bash
#webhsot
#by Caishzh 2013
Webshotdir= "/data/webshot"
Mkdir-p $WEBSHOTDIR
While Read line
Todo
display=:0 cutycapt--url=http://$LINE--max-wait=90000--out= $WEBSHOTDIR/$LINE. jpg >/dev/null 2>&1
Done<domain.txt
The script is very simple, do not comment, domain.txt is the URL list. CUTYCAPT installation and use refer to here.
Execute the script, can the normal screenshot, the picture quality is also very high. But another problem has emerged, tens of thousands of screenshots of the site, the time period is too long, estimates need about half a month.
The time is too long, cannot afford, needs to optimize the script. Found the next data, decided to use a multithreaded screenshot. In fact, the shell can not implement multiple threads, just put more than one process into the background execution.
multiwebshot.sh
#/bin/bash
#Multithreading Webshot
#by Caishzh 2013
Webshotdir= "/data/webshot"
Mkdir-p $WEBSHOTDIR
#将domain. txt split into 10 files (beginning with x), 5000 rows per file
Split-l 5000 Domain.txt
For i in ' ls x* ';d o
{
For j in ' Cat $i ';d o
display=:0 cutycapt--url=http://$j--max-wait=90000--out= $WEBSHOTDIR/$j. jpg >/dev/null 2>&1
Done
}&
Done
Wait
#删除由spilt分割出的临时文件
RM x*-F
Script Description:
Split the domain.txt into multiple files, 5000 rows per file, and use two nested for loops to achieve a multiple-process screenshot. The first for is a list of file names that are split by split. A second for the screenshot of the Web site in these files, note that the &,& behind the curly braces is to put the script code in the curly braces in the background, which simulates the effect of "multithreading", which is actually a multiple process. Wait waits for the previous background task to complete before proceeding.
Use this script to greatly improve the screenshot is the speed, in two days or so time to complete all the site screenshots, the effect is significant. Note that the Cutycapt screenshot is required to occupy a larger network bandwidth and CPU resources, in a poorly configured machine do not open too many cutycapt "thread", so as not to cause the machine panic.
python multithreaded Web truncation
Just recently learning python, and Python can easily support multithreading. Found some information, using threading+queue way to achieve the "Nonse Dorau" of the multi-threaded screenshot:
#coding: Utf-8
Import Threading,urllib2
Import Datetime,time
Import Queue
Import OS
Class Webshot (threading. Thread):
def __init__ (Self,queue):
Threading. Thread.__init__ (self)
Self.queue=queue
def run (self):
While True:
#如果队列为空, exit, or remove a URL from the queue, and take a screenshot
If Self.queue.empty ():
Break
Host=self.queue.get (). strip (' \ n ')
shotcmd= "display=:0 cutycapt--url=http://" +host+ "--max-wait=90000--out=" +host+ ". jpg"
Os.system (Shotcmd)
Self.queue.task_done ()
Time.sleep (1)
def main ():
Queue=queue.queue ()
F=file (' Domain.txt ', ' R ')
# populate the queue with data
While True:
Line=f.readline ()
If Len (line) ==0:
Break
Queue.put (line)
#生成一个 Threads pool and pass the queue to the thread function for processing, which opens 10 threads concurrent
For I in Range (0,10):
Shot=webshot (Queue)
Shot.start ()
If __name__== "__main__":
Main ()
The procedure is described as follows:
1. Create an instance of Queue.queue () and deposit the list of Web sites in Domain.txt into the queue
2, for loop generation 10 threads concurrency
3, the queue instance is passed to the thread class Webshot, which is inherited by threading. Thread in the way created by the
4. Each time you take a project out of the queue and use the data and run methods in that thread to perform the appropriate work
5. After completing this work, use the Queue.task_done () function to send a signal to the queue that the task has completed