Previously, I used Scala and go to write a crawler. I recently read Perl and I will write a version with the same functions. Use the lwp: simple module and install it with CPAN lwp (Ubuntu 13.04 does not provide this important module with Perl, which is a pity ). The Code is as follows:
1 #!/usr/bin/perl 2 use LWP::Simple qw/get/; 3 4 my %pages; 5 print "Processing the index.\n"; 6 $_ = get("http://www.yifan100.com/dir/15136/"); 7 while(m#<a target="_blank" href="/article/(.*?)\.html" title="(.*?)" >#g){ 8 $pages{$1}=$2; 9 }10 for(keys %pages){11 my ($l, $f) = ("http://www.yifan100.com/article/$_.html", "$_.txt");12 open F, ">$f";13 print "Processing $l.\n";14 if(get($l) =~ m#<div class="artcontent">(.*)<div id="zhanwei">#s){15 $_ = $1;16 s#<br>#\n#g;17 s#<.*?>##gs;18 s#^\s+##g;19 print "Writing to $f.\n";20 print F;21 }22 close F;23 }
Obviously, the Code uses a single thread (or a single process), and the execution time is still a lot (it is assumed that the HTTP download time is relatively long ), I forgot the test time of other versions before. The time is as follows:
real 3m58.753s user 0m0.900s sys 0m0.632s
Obviously, Perl is used, and the code is much less (compared with Scala and go). Processing text is the advantage of Perl.