PHP and Python implementation of the thread pool multithreading crawler features sample _php tips

Source: Internet
Author: User
Tags php class php example php programming php regular expression

This article describes the PHP and Python implementation of the thread pool multi-threaded crawler capabilities. Share to everyone for your reference, specific as follows:

Multithreading crawler can be used to crawl content of this can improve performance, here we look at the PHP and Python thread pool multithreaded crawler example, the code is as follows:

PHP Example

<?php class Connect extends Worker//worker mode {public Function __construct () {} public Function getconnection () {if
(!self:: $ch) {self:: $ch = Curl_init (); Curl_setopt (self:: $ch, Curlopt_timeout, 2); curl_setopt (self:: $ch, Curlopt_returntransfer, 1
);
curl_setopt (self:: $ch, Curlopt_header, 0);
curl_setopt (self:: $ch, Curlopt_nosignal, true);
curl_setopt (self:: $ch, Curlopt_useragent, "Firefox");
curl_setopt (self:: $ch, Curlopt_followlocation, 1);
}/* Do some exception/error stuff maybe/return self:: $ch; The Public Function CloseConnection () {Curl_close (self:: $ch);}/** * The link is stored statically, which for PT
Hreads, means thread local * * */protected static $ch;  Class Query extends Threaded {public function __construct ($url) {$this->url = $url;} public Function Run () {$ch =
$this->worker->getconnection ();
curl_setopt ($ch, Curlopt_url, $this->url);
$page = curl_exec ($ch);
$info = Curl_getinfo ($ch);
$error = Curl_error ($ch); $this->deal_data($this->url, $page, $info, $error);
$this->result = $page;  function Deal_data ($url, $page, $info, $error) {$parts = Explode (".", $url); $id = $parts [1]; if ($info [' Http_code ']!=
{$this->show_msg ($id, $error);} else {$this->show_msg ($id, "OK");} function show_msg ($id, $msg) {echo $id. "
\t$msg\n ";
The Public Function GetResult () {return $this->result;} protected $url;
protected $result; function Check_urls_multi_pthreads () {global $check _urls;//define crawled connection $check _urls = Array (' http://xxx.com ' => ' xx net ')
; $pool = new Pool ("Connect", Array ());
Create 10 thread pool foreach ($check _urls as $url => $name) {$pool->submit (new Query ($url));} $pool->shutdown ();
} check_urls_multi_pthreads (); Python multithreaded def handle (SID)://This method executes the Crawler data processing pass class Mythread (Thread): "" DocString for ClassName "" Def __init__ (self, S
ID): thread.__init__ (self) self.sid = Sid def Run (): Handle (SELF.SID) threads = [] for i in Xrange (1,11): t = mythread (i) Threads.append (t) T.start ()
For T in Threads:t.join ()

 

Python thread pool crawler:

From queue import \ Threading Import Thread, Lock import urllib.parse import socket import re import time Seen_u RLS = Set (['/']) lock = Lock () class Fetcher (Thread): Def __init__ (self, Tasks): thread.__init__ (self) self.task s = Tasks Self.daemon = True Self.start () def run (self): while True:url = Self.tasks.get () prin T (URL) sock = Socket.socket () sock.connect (' localhost ', 3000)) get = ' get {} ' Http/1.0\r\nhost:localhos  t\r\n\r\n '. Format (URL) sock.send (Get.encode (' ASCII ')) response = B ' chunk = SOCK.RECV (4096) while Chunk:response + + Chunk chunk = Sock.recv (4096) links = self.parse_links (URL, response) lock
      . Acquire () for link in links.difference (seen_urls): Self.tasks.put (link) seen_urls.update (links) Lock.release () Self.tasks.task_done () def parse_links (self, Fetched_url, response): If not RESPONSE:PR Int (' ERROR: {} '. Format (fEtched_url)) return set () if not self._is_html (response): Return set () URL = set (Re.findall (R) (? i) H Ref=["']? ([^\s ' <>]+) ', Self.body (response))) links = set () for URL in urls:normalized = Urll Ib.parse.urljoin (Fetched_url, url) parts = urllib.parse.urlparse (normalized) if parts.scheme not in (', ' http ', ' https '): Continue host, port = Urllib.parse.splitport (Parts.netloc) if host and host.lower () not I N (' localhost '): Continue defragmented, Frag = Urllib.parse.urldefrag (Parts.path) Links.add (defragment ED) return Links def body (self, response): BODY = Response.split (b ' \r\n\r\n ', 1) [1] return Body.decode (' utf-  8 ') def _is_html (self, Response): Head, BODY = Response.split (b ' \r\n\r\n ', 1) headers = dict (H.split (': ') for H 
  In Head.decode (). Split (' \ r \ n ') [1:]) return Headers.get (' Content-type ', '). StartsWith (' text/html ') class ThreadPool: Def __init__ (Self, num_threads): Self.tasks = Queue () for _ in range (num_threads): Fetcher (self.tasks) def add_task (sel F, url): Self.tasks.put (URL) def wait_completion (self): Self.tasks.join () if __name__ = = ' __main__ ': start = t Ime.time () Pool = ThreadPool (4) Pool.add_task ("/") Pool.wait_completion () print (' {} URL fetched in {:. 1f} seconds

 '. Format (len (seen_urls), Time.time ()-start)

More about PHP Interested readers can view the site topics: "Php Curl Usage Summary", "PHP array" operation Skills Daquan, "PHP Sorting algorithm Summary", "PHP common traversal algorithm and skills summary", "PHP Data structure and algorithm tutorial", " PHP Programming algorithm Summary, "PHP Mathematical Calculation Skills Summary", "PHP Regular Expression Usage summary", "PHP operation and operator Usage Summary", "PHP string (String) Usage summary" and "PHP common database Operation skill Summary"

I hope this article will help you with the PHP program design.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.