Simple implementation of Perl crawlers

Source: Internet
Author: User

Because there is a project in the work that needs to crawl the content of a third-party site, a simple crawler is written under Linux using Perl.

Related tools

1. httpwatch/Browser Developer Tools

In general, this tool is not available, but if you find that the content to crawl in the HTML source of the page can not be found, if some of the pages are asynchronously request data through Ajax, this time need to httpwatch such tools to find the actual HTTP request URL, Of course, many browsers now have developer tools (such as Chrome, Firefox, etc.) that make it easier to see all the requested URLs.

2. Curl/wget

This is the most important tool in the crawler, the role is to simulate the browser's HTTP request, so as to obtain data. The general is to request a URL to get the corresponding Web page HTML source, you can also download files. With Curl and wget, this task is easy to accomplish.

3. Perl

When the page is crawled down, it is necessary to extract the required information from the HTML, and the regular expression will be used. I'm using Perl to write a crawler script. The reason for not using the shell is because the shell's regular match function is too weak. Of course, there are a lot of scripting languages that are very powerful in regular matching, like Python, if you're unfamiliar with these scripting languages, Perl is easier to get started with.

4. Regular expressions

The syntax for regular expressions is mostly generic, but there are some small differences between the languages that are not used, and here are some of the important syntax for Perl's regular Expressions:

Meta characters.

Anchor position ^ $ \b

Character Set \d \w \s

Quantifier *? + {M,n}

Group () (ABC) *

Choose a Match | (AB|BC)

Capture Variable ()


Crawl steps

The following is an example of crawling a website's mobile app, which explains the steps to crawl.

1. Crawling directories

Suppose I need to crawl the directory page is, first find the page index and URL of the law, this is very simple, just need to put[i Replace the [i] in]_new.html with the page index. Next, you need to know how many pages there are to know when the Crawl Directory page is finished. The General page will show how many pages, but the page to crawl here is not, how to do? You can manually see how many pages there are, and one way is to crawl to a page to find that there are no matching catalog items, it means that all the catalog pages have been crawled.

After the Directory page crawl down, the level two page ULR through regular matching extracted, write to the database, the URL can identify a page uniqueness, so to ensure that the URL to write data is not duplicated. It is important to note that the URL in the HTML may be a relative path, and the URL needs to be fully complement.

In most cases, incremental crawls are required, such as crawling only new catalog items per day, and in order to prevent repeated invalid crawls, it is best to sort the catalog pages by the update time so that only the first few pages of the update can be crawled. How do you know which catalog pages are updated? If the catalog item has an update time, it can be determined by comparing this time. Another easier way is if all the URLs in a page exist in the database, there is no new catalog entry for this page, and you can stop crawling.

2. Crawling Detailed information

In the first step, the URL of the level two page has been crawled down, the next is to crawl the details, such as the various information on the mobile app, and the URL of the installation package. For text information is easy to extract in HTML, but for the installation package URL is not a glance can be found, here is hidden in JS, such as the page to extract the ID, you can spell out the installation package URL. For crawling completed URLs, the database should use the Status field to indicate its crawl completion, avoiding repeated crawls.

3. File download

Sometimes we not only crawl text messages, we also need to download pictures or files, such as here we also need to download and install the package, in the previous step has crawled and installed the URL of the package, using curl or wget can be easily downloaded file. The Status field is also required to indicate the download status of the file.

Generic and extended

1. Common Crawl Interface

In order to reduce some crawling duplicate code, here to extract some common code, write a more general crawl interface, it should be noted that because the page encoding and data encoding may be inconsistent, so the page encoding needs to be converted into a database encoding, or write data may appear garbled. The interface description and code are as follows:

Call mode: @results =&crawlurl ($url, $page _charset, $expect _charset, \ @regexs, \ $crawl _result)

Parameters: URL, page encoding, expected encoding, regular expression array, whether the crawl succeeded (0 succeeded, otherwise failed)

Return value: Match result two-dimensional array (a regular expression can match a set of data)

1 #!/usr/bin/perl
 3 sub ParseUrl
 4 {
 5     my $url=$_[0];
 6     $url=~s/\[/\\\[/g;
 7     $url=~s/\]/\\\]/g;
 8     return $url;
 9 }
11 sub CrawlUrl
12 {
13     my $url=$_[0];
14     my $page_charset=$_[1];
15     my $expect_charset=$_[2];
16     my $regex_ref=$_[3];
17     my $crawl_result_ref=$_[4];
18     my @regexs[email protected]$regex_ref;
19     my @results;
21     my $file=`echo -n "$url" | md5sum | awk ‘{print \$1".htm"}‘`;
22     chomp($file);
23     $url=&ParseUrl($url);
24     `curl -o "$file" "$url"`;
25     my $curl_result=`echo $?`;
26     chomp($curl_result);
27     if($curl_result!=0)
28     {
29         $$crawl_result_ref=1;
30         return @results;
31 }
33     my $html="";
34     if($page_charset ne "" && $expect_charset ne "" && $page_charset ne $expect_charset)
35     {
36         $html=`iconv -f $page_charset -t $expect_charset "$file"`;
37     }
38     else
39     {
40         $html=`cat "$file"`;
41     }
42     `rm -f $file`;
44     for(my $i=0;$i<=$#regexs;$i++)
45     {
46         my $reg=@regexs[$i];
47         my @matches=($html=~/$reg/sg);
48         $results[$i]=\@matches;
49     }
51     $$crawl_result_ref=0;
52     return @results;
53 }

2. Crawler Versatility

We may need to crawl the same type of multiple sites, such as I need to crawl dozens of sources of mobile app, if each site is written a specific crawler, will bring a lot of coding work, this time to consider the versatility of the crawler, how to make a set of code to adapt to a class of web sites. The method used here is to store the differentiated information of each website as configuration stored in the database, such as the Directory page URL, website code, the word Ha Zheng expression, etc., so that the crawler can be read these configurations to adapt to different sites, to achieve a certain versatility. If you want to add a crawl to a Web site, you only need to add the appropriate configuration instead of modifying any code.

3. Multi-Process crawling

If the page you are crawling or the number of files you want to download is relatively time-consuming, you can consider that multiple processes are crawling at the same time. Write a Process Control module to determine whether new processes are enabled for control over multi-process crawls by querying the database for URLs that are not crawled and detecting the number of processes that are currently enabled for crawling.

4. Agent

Some websites may limit the frequency of IP access, and if the crawl frequency of the site is high, it may lead to IP being blocked, you can avoid this problem by randomly switching over multiple proxy servers. To avoid code duplication, write a shell tool that uses the proxy wget wrapper.

1 #! / Bin / bash
  3 PROXY_HOST = (list of proxy servers)
  5 function GetProxyStr ()
  6 {
  7 rand = $ (($ RANDOM% ($ (# PROXY_HOST [*]} + 1)))
  8 if [$ rand -lt $ {# PROXY_HOST [*]}]
  9 then
10 PROXY_STR = "-e http_proxy = $ {PROXY_HOST [$ rand]}"
11 fi
14 PROXY_STR = ""
15 PATH_TYPE = "$ 1"
16 FILE_PATH = "$ 2"
17 URL = "$ 3"
19 GetProxyStr
20 GetPath
twenty one 
22 wget --user-agent = "Mozilla / 5.0 (X11; U; Linux i686; en-US; rv: Gecko / 2008092416 Firefox / 3.0.3" $ PROXY_STR $ PATH_TYPE "$ FILE_PATH" "$ URL " 

5. Monitoring

Another problem is that if the crawler is running regularly every day, the URL of the Site Directory page changes or page revision, crawl will fail. This requires the monitoring of these failures, in the crawl page failure or regular matching failure, by SMS, mail and other means to alarm.

Simple implementation of Perl crawlers

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.