Linux multithreaded Web screenshot python and Shell methods

Source: Internet
Author: User
Tags mkdir thread class in domain

Shell multithreaded Web page truncation

Linux two screenshots Tools Cutycapt and PHANTOMJS, after testing, Cutycapt screenshot is slower, but relatively stable, phantomjs screenshots faster, but there is a process of suspended animation state. Weigh the pros and cons and decide how to use the Cutycapt+shell script to screenshot:

webshot.sh

#/bin/bash
#webhsot
#by Caishzh 2013

Webshotdir= "/data/webshot"
Mkdir-p $WEBSHOTDIR

While Read line
Todo
display=:0 cutycapt--url=http://$LINE--max-wait=90000--out= $WEBSHOTDIR/$LINE. jpg >/dev/null 2>&1
Done<domain.txt

The script is very simple, do not comment, domain.txt is the URL list. CUTYCAPT installation and use refer to here.
Execute the script, can the normal screenshot, the picture quality is also very high. But another problem has emerged, tens of thousands of screenshots of the site, the time period is too long, estimates need about half a month.
The time is too long, cannot afford, needs to optimize the script. Found the next data, decided to use a multithreaded screenshot. In fact, the shell can not implement multiple threads, just put more than one process into the background execution.

multiwebshot.sh

#/bin/bash
#Multithreading Webshot
#by Caishzh 2013

Webshotdir= "/data/webshot"
Mkdir-p $WEBSHOTDIR

#将domain. txt split into 10 files (beginning with x), 5000 rows per file
Split-l 5000 Domain.txt


For i in ' ls x* ';d o
{
For j in ' Cat $i ';d o
display=:0 cutycapt--url=http://$j--max-wait=90000--out= $WEBSHOTDIR/$j. jpg >/dev/null 2>&1
Done
}&
Done
Wait

#删除由spilt分割出的临时文件
RM x*-F
Script Description:
Split the domain.txt into multiple files, 5000 rows per file, and use two nested for loops to achieve a multiple-process screenshot. The first for is a list of file names that are split by split. A second for the screenshot of the Web site in these files, note that the &,& behind the curly braces is to put the script code in the curly braces in the background, which simulates the effect of "multithreading", which is actually a multiple process. Wait waits for the previous background task to complete before proceeding.
Use this script to greatly improve the screenshot is the speed, in two days or so time to complete all the site screenshots, the effect is significant. Note that the Cutycapt screenshot is required to occupy a larger network bandwidth and CPU resources, in a poorly configured machine do not open too many cutycapt "thread", so as not to cause the machine panic.

python multithreaded Web truncation

Just recently learning python, and Python can easily support multithreading. Found some information, using threading+queue way to achieve the "Nonse Dorau" of the multi-threaded screenshot:

#coding: Utf-8
Import Threading,urllib2
Import Datetime,time
Import Queue
Import OS

Class Webshot (threading. Thread):
def __init__ (Self,queue):
Threading. Thread.__init__ (self)
Self.queue=queue

def run (self):
While True:
#如果队列为空, exit, or remove a URL from the queue, and take a screenshot
If Self.queue.empty ():
Break
Host=self.queue.get (). strip (' \ n ')
shotcmd= "display=:0 cutycapt--url=http://" +host+ "--max-wait=90000--out=" +host+ ". jpg"
Os.system (Shotcmd)
Self.queue.task_done ()
Time.sleep (1)

def main ():
Queue=queue.queue ()
F=file (' Domain.txt ', ' R ')

# populate the queue with data
While True:
Line=f.readline ()
If Len (line) ==0:
Break
Queue.put (line)

#生成一个 Threads pool and pass the queue to the thread function for processing, which opens 10 threads concurrent
For I in Range (0,10):
Shot=webshot (Queue)
Shot.start ()

If __name__== "__main__":
Main ()

The procedure is described as follows:

1. Create an instance of Queue.queue () and deposit the list of Web sites in Domain.txt into the queue
2, for loop generation 10 threads concurrency
3, the queue instance is passed to the thread class Webshot, which is inherited by threading. Thread in the way created by the
4. Each time you take a project out of the queue and use the data and run methods in that thread to perform the appropriate work
5. After completing this work, use the Queue.task_done () function to send a signal to the queue that the task has completed

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.