A good Perl-LWP document

Source: Internet
Author: User
Tags http authentication gopher

Http://hi.baidu.com/0soul/blog/item/91098701f5051880e850cd4b.html

Original article: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
There are a lot of copyright information, and I will write the rest at the end.
I think this Perl LWP document is quite good, so I will turn it over for convenient reference at any time.
Below is the reprinted content:

**************************************** **************************************** * ** The Fuck Cutline ********************************* **************************************** ***********

Introduction to LWP and WEB

Lwp ("library for WWW in Perl") is a group of modules used to obtain network data. It is the same as many Perl modules. Each lwp module has its own detailed documentation as a complete introduction to this module. However, in the face of many modules in lwp, even if it is the simplest job, new users often do not know where to start. A full introduction to lwp requires a whole book. Fortunately, Perl & lwp has been published. This article introduces you to the most common lwp usage.

Use LWP: Simple to get the webpage

If you only want to get a webpage, the functions in lwp: simple are the simplest. Call the get ($ URL) function to obtain the URL content. If no error occurs, the get function returns this webpage. Otherwise, the function returns UNDEF. Example:

My $ url = 'HTTP: // freshair.npr.org/dayfa.cfm? Todaydate = Current'

Use lwp: simple;
My $ content = get $ URL;
Die "couldn't get $ URL" Unless defined $ content;

# $ Content contains webpage content, which is analyzed below:

If ($ content = ~ M/jazz/I ){
Print "they're talking about jazz today on fresh air! N ";
} Else {
Print "fresh air is apparently jazzless today. N ";
}

If you want to run it in the command line, the getprint function is very convenient. If no error occurs, the webpage content is output to stdout; otherwise, an error message is output to stderr. For example:

% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"

The preceding URL points to a text file, listing the files updated by CPAN in the last two weeks. If you want to know whether the Acme: module has been updated, You can e-mail yourself. You can combine it with Shell to implement it. As follows:

% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"  
| grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER

LWP: Simple has some very useful functions, including a function that runs the HEAD request to check whether the link is valid and whether the webpage is updated. The other two functions are also worth mentioning. For details, see the LWP: Simple document or the second chapter of Perl & LWP.

LWP Class model Basics

LWP: Simple is very convenient for Simple work. However, cookies, user authentication, editing of HTTP request headers, and reading and writing of HTTP resonse headers (mainly HTTP error messages) are not supported ). Therefore, when these features are required, the LWP Class model must be used. Among many LWP classes, LWP: UserAgent and HTTP: Response are required. LWP: UserAgent is like a virtual browser used for request ). HTTP: Response is used to store the response generated by the request ).

The most basic usage is $ response = $ browser-> get ($ url), or write more completely:

# Program start:

Use LWP 5.64; # load a newer version of LWP classes

My $ browser = LWP: UserAgent-> new;

...

# Get request:
My $ url = 'HTTP: // freshair.npr.org/dayFA.cfm? TodayDate = current ';

My $ response = $ browser-> get ($ url );
Die "can't get $ URL --", $ response-> status_line
Unless $ response-> is_success;

Die "hey, I want HTML format instead of", $ response-> content_type
Unless $ response-> content_type EQ 'text/html ';
# Or any other Content-Type

# If it succeeds, it will process the content

If ($ response-> content = ~ M/jazz/I ){
Print "fresh air is discussing jazz today! N ";
} Else {
Print "the fresh air discussed today does not touch jazz at all. N ";
}

There are two related objects: $ browser, which is an object of lwp: useragent. $ Response is an object of the http: Response class. Only one $ browser object is required in a program, but each time a request is sent, a new http: Response object is obtained. HTTP: Response object has the following valuable attributes:

  • A status code value, indicating success or failure. You can use $ response-> is_success to detect it.
  • HTTP status line (HTTP status description). Observing the result of $ response-> status_line (such as "404 Not Found") will help you understand the meaning of the word.
  • Mime Content-Type (file type) is obtained through $ response-> content_type. For example, text/html, image/GIF, and application/XML.
  • Content of the response (response returned content) is stored in $ response-> content. The content may be in HTML format. For GIF format, $ response-> content contains binary GIF data.
  • Many other methods can be found in http: Response and Its superclasses (parent class) http: Message and http: headers.
Add other HTTP request headers

The common request method is $ response = $ browser-> get ($ URL, you can add other HTTP headers to your request in the $ URL followed by a key value list. Like this:

$response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );

For example, if you want to send a request to a website that only allows the Netscape Browser to connect to, you need to send a header similar to Netscape, as shown below:

my @ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);

...

$response = $browser->get($url, @ns_headers);

If you do not want to use this array again, you can write it into the get function.

$response = $browser->get($url,
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);

If you only want to modify the User-Agent, you can use the LWP: UserAgent agent method to change the default agent 'libwww-perl/123456' (or other.

$browser->agent('Mozilla/4.76 [en] (Win98; U)');
Use cookies

The default LWP: UserAgent object works like a browser that does not support cookies. There are more than one way to set its cookie_jar attribute to support cookies. "Cookie jar" is a container used to store HTTP cookies. You can save the token to a hard disk (like netscapeuses cookies.txt) or memory. Cookies stored in the memory will disappear after the program is completed. Memory-based cookie usage:

$browser->cookie_jar({});

You can also store cookies in files on the hard disk:

Use HTTP: Cookies;
$ Browser-> cookie_jar (HTTP: Cookies-> new (
'File' => '/some/where/cookies. lwp ',
# Cookie storage address
'Autosave' => 1,
# Automatically stored in the hard disk after completion
));

The cookie in the file is stored in the LWP custom format. If you want to use this cookie file in netscape, you can use HTTP: Cookies: Netscape class:

use HTTP::Cookies;
# yes, loads HTTP::Cookies::Netscape too

$browser->cookie_jar( HTTP::Cookies::Netscape->new(
'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt',
# where to read cookies
));

You can also use 'autosave' => 1 as above. However, Netscape cookies are sometimes lost before being written to the hard disk, at least in this article.

Submit a table through POST

Most HTML tables use html post to submit data to the server. Here you can:

$response = $browser->post( $url,
[
formkey1 => value1,
formkey2 => value2,
...
],
);

Or you can send the HTTP header together.

$response = $browser->post( $url,
[
formkey1 => value1,
formkey2 => value2,
...
],
headerkey1 => value1,
headerkey2 => value2,
);

The next example is to send an http post request to the search engine of AltaVista, and then extract the total number of matches from the HTML.

Use strict;
Use warnings;
Use lwp 5.64;
My $ browser = lwp: useragent-> new;

My $ word = 'tarragon ';

My $ url = 'HTTP: // www.altavista.com/sites/search/web ';
My $ response = $ browser-> post ($ URL,
['Q' => $ word, # The AltaVista query string
'Pg '=> 'Q', 'avkw' => 'tgz', 'kl '=> 'XX ',
]
);
Die "$ URL error:", $ response-> status_line
Unless $ response-> is_success;
Die "weird content type at $ URL --", $ response-> content_type
Unless $ response-> content_type EQ 'text/html ';

If ($ response-> content = ~ M {AltaVista found ([0-9,] +) Results }){
# Matching results from "AltaVista found 2,345 results"
Print "$ word: $ 1n ";
} Else {
Print "Couldn't find the match-string in the responsen ";
}
Submit a table through GET

Some HTML tables use GET requests to transmit data instead of POST requests. For example, if you search for the movie name 'Blade Runner' in imdb.com and submit the file, it will be displayed in the URL bar of the browser:

http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TV

The following is the same result with LWP:

use URI;
my $url = URI->new( 'http://us.imdb.com/Tsearch' );
# makes an object representing the URL

$url->query_form( # And here the form data pairs:
'title' => 'Blade Runner',
'restrict' => 'Movies and TV',
);

my $response = $browser->get($url);

Chapter 2 describes HTML tables and table data in detail. Chapter 2 to Chapter 2 describes how to extract useful information from the obtained HTML data.

URL Processing

The URI class mentioned above provides many methods to get and modify URLs. For example, if you want to know the type of url (http, ftp, etc.), you can use $ url-> schema. If you want to extract the host name from the url, you can use $ url-> host. However, the most useful method is the query_form method I mentioned earlier, and the relative URL path (such ".. /foo.html) to the new_abs method of the absolute path (for example, "http: // www.perl.com/stuff/foo.html. Example:

use URI;
$abs = URI->new_abs($maybe_relative, $base);

Now let's recall the example of getting the latest CPAN module.

use strict;
use warnings;
use LWP 5.64;
my $browser = LWP::UserAgent->new;

my $url = 'http://www.cpan.org/RECENT.html';
my $response = $browser->get($url);
die "Can't get $url -- ", $response->status_line
unless $response->is_success;

my $html = $response->content;
while( $html =~ m/chunk86920392chunklt;A href=\"(.*?)\"/g ) {
print "$1n";
}

The output result is

MIRRORING.FROM
RECENT
RECENT.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AA/AASSAD/CHECKSUMS
...

You can use the new_abs method of the URI module to obtain the full URL path and modify the while loop:

while( $html =~ m/<A href=\"(.*?)\"/g ) {
print URI->new_abs( $1, $response->base ) ,"n";
}

The $ response-> base method can be found in HTTP: Message. The returned URL is usually used to merge with the relative path to obtain the full path. The result is:

http://www.cpan.org/MIRRORING.FROM
http://www.cpan.org/RECENT
http://www.cpan.org/RECENT.html
http://www.cpan.org/authors/00whois.html
http://www.cpan.org/authors/01mailrc.txt.gz
http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
...

For more details about URI objects, see chapter 4 of Perl & LWP. Of course, it is relatively simple to use regexp (Regular Expression) to match URLs. If the situation is complex and you need more powerful matching tools, consider the HTML analysis module HTML: LinkExtor or HTML: TokeParser, even HTML: TreeBuilder

Other browser attributes

LWP: UserAgent objects has several noteworthy attributes:

  • $ Browser-> timeout (15): Set the timeout time of the default request. If this time is exceeded, the request is abandoned.
  • $ Browser-> protocols_allowed (['HTTP ', 'gopher']): this parameter is used to set that only HTTP and Gopher protocols are accepted. When other protocols are connected, an error message of "Access to FTP Uris has been disabled" is returned.
  • Use lwp: conncache; $ browser-> conn_cache (lwp: conncache-> New (): This tells browser object to use the HTTP/1.1 "keep-alive" feature, that is, repeat the previous socket to speed up the request.
  • $ Browser-> agent ('somename/1.23 (more info here maybe) '): sets the User-Agent of the HTTP request. By default, lwp uses "libwww-perl/versionnumber" as the User-Agent, for example, "libwww-perl/5.65 ". You can add more information:
    $browser->agent( 'SomeName/3.14 (contact@robotplexus.int)’ );

    Or you can pretend to be

    $browser->agent( 'Mozilla/4.0 (compatible; MSIE 5.12; Mac_PowerPC)’ );
  • Push @ {$ UA-> requests_redirectable}, 'post': tells lwp to automatically follow after the POST request is sent if a redirection occurs (although this is not required in RFC)

For details, see the LWP: UserAgent documentation.

Write a polite robot

If you want to follow robots.txt and avoid sending too many requests in a short time, you can use LWP: RobotUA instead of LWP: UserAgent. The usage of LWP: RobotUA is the same as that of LWP: UserAgent:

Use LWP: RobotUA;
My $ browser = LWP: RobotUA-> new (
'Yoursuperbot/100', 'you @ yoursite.com ');
# Robot name and email address

My $ response = $ browser-> get ($ url );

HTTP: RobotUA has the following features:

  • If the $ URL request server's robots.txt disables your access to $ URL, $ browser will not send a request to this address, instead, code 403 and an error message "forbidden by robots.txt" are returned ".
    die "$url -- ", $response->status_line, "nAborted"
    unless $response->is_success;

    Then you will get the following error message:

    http://whatever.site.int/pith/x.html -- 403 Forbidden
    by robots.txt
    Aborted at whateverprogram.pl line 1234
  • If $ browser finds that the request address has just been requested, it will suspend (sleep) to avoid sending too many requests. The default value is 1 minute, but can be set through $ browser-> delay (minutes. For example:
    $browser->delay( 7/60 );

For details, see the LWP: RobotUA documentation.

Use proxy

Sometimes you want (or must) to connect to some sites or protocols through a proxy. For example, your LWP program runs on a machine in the firewall. The proxy is usually stored in the environment variable HTTP_PROXY. LWP can load the proxy address in the environment variable through the env_proxy function in the user-agent object.

use LWP::UserAgent;
my $browser = LWP::UserAgent->new;

# And before you go making any requests:
$browser->env_proxy;

For details, see the proxy, env_proxy and no_proxy methods in the LWP: UserAgent document.

HTTP Authentication

Many websites use HTTP authentication to restrict connections. when a user requests a restricted page, the HTTP server replies "That document is part of a protected 'realm' and you can access it only if you re-request it and add some special authorization headers to your request ". (You have requested a restricted area. If you need to resend a header with authentication information, you can connect to it .) unicode.org administrators must perform HTTP authentication to prevent robots from accessing the contact list to obtain the sender address. The user name and password are public: User name: unicode-ml password: unicode

Assume that the address of a restricted page is

http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html

If you request this address in your browser, a new window will pop out and "Enter username and password for 'unicode-MailList-Archives 'at server 'www .unicode.org'" will be displayed '". Enter the user name and password, as shown in the following figure:

Simply use LWP to request this URL:

use LWP 5.64;
my $browser = LWP::UserAgent->new;

my $url =
'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
my $response = $browser->get($url);

die "Error: ", $response->header('WWW-Authenticate') ||
'Error accessing',
# ('WWW-Authenticate' is the realm-name)
"n ", $response->status_line, "n at $urln Aborting"
unless $response->is_success;

You will get the following error:

Error: Basic realm="Unicode-MailList-Archives"
401 Authorization Required
at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
Aborting at auth1.pl line 9. [or wherever]

This is because LWP does not know the user name and address of the "Unicode-MailList-Archives" Area in host www.unicode.org. The simplest way to solve this problem is to use the credentials method to provide the user name and password:

$browser->credentials(
'servername:portnumber',
'realm-name',
'username' => 'password'
);

Generally, the port is 80. The credentials function must be called before sending a request. For example:

$browser->credentials(
'reports.mybazouki.com:80',
'web_server_usage_reports',
'plinky' => 'banjo123'
);

Our unicode.org example can be written

$browser->credentials(  # add this to our $browser 's "key ring"
'www.unicode.org:80',
'Unicode-MailList-Archives',
'unicode-ml' => 'unicode'
);
Connect to HTTPs URLs

If you have HTTPs support for LWP installation, you can access HTTPs URLs in the same way as HTTP.

use LWP 5.64;
my $url = 'https://www.paypal.com/'; # Yes, HTTPS!
my $browser = LWP::UserAgent->new;
my $response = $browser->get($url);
die "Error at $urln ", $response->status_line, "n Aborting"
unless $response->is_success;

print "Whee, it worked! I got that ",
$response->content_type, " document!n";

If HTTPs is not supported, the following error message is displayed.

Error at https://www.paypal.com/
501 Protocol scheme 'https' is not supported
Aborting at paypal.pl line 7. [or whatever program and line]

If you have installed HTTPS support for LWP, your request should be successful. You can process the $ response object just like a normal HTTP request. For information on installing HTTPS support, see the README. SSL file in libwww-perl.

Get large files

When a large file is requested, common request methods (for example, $ response = $ browser-> get ($ url) may cause memory problems. because $ response stores the entire file. it may not be wise to request a 30 MB file. one solution is to save the file to the hard disk.

$response = $ua->get($url,
':content_file' => $filespec,
);

For example:

$response = $ua->get('http://search.cpan.org/',
':content_file' => '/tmp/sco.html'
);

When content_file is used, headers is still in $ response, but $ response-> content is empty. it is worth noting that versions earlier than LWP 5.66 do not support content_file. you should use LWP 5.66; if your program may run on a lower version of LWP, you can also use the following example to ensure compatibility, which has the same effect as content_file.

use HTTP::Request::Common;
$response = $ua->request( GET($url), $filespec );
Resources

The above is just an introduction to common LWP functions. For more information about LWP and LWP, see the following documents.

  • Lwp: simple: Provides simple get, Head, and mirror methods.
  • Lwp: libwww-perl module Overview
  • HTTP: Response: The response obtained after the LWP request is sent. $ Response = $ browser-> get (...).
  • HTTP: Message and HTTP: Headers: HTTP: Response many methods come from both.
  • URI: Class processing completely and relative URL path.
  • URI: Escape: To correctly process and convert irregular characters in the URL (for example, the conversion between "this & that" and "this % 20% 26% 20that ).
  • HTML: Entities: To correctly process and convert irregular characters in HTML (such as "C. & E. Bront ??" And "C. & E. Bront ??" ).
  • HTML: TokeParser and HTML: TreeBuilder: Classes analysis HTML
  • HTML: LinkExtor: Class: Find the link in HTML
  • Of course, there is my Perl & LWP.
Notes after discussion with Sean Burke

When translating this article, I contacted the author Sean Burke. he agreed to point out the content to be supplemented and updated in the original article. I also communicated with the author through MSN. here I will write my understanding of LWP and my supplement to the author here.

  • I mentioned in the article that
    use LWP::ConnCache;

    $browser->conn_cache(LWP::ConnCache->new()):

    This tells browser object to use the HTTP/1.1 "keep-Alive" feature, that is, repeat the previous socket to speed up the request. you can also add the "keep-Alive" feature to $ browser in new LWP: UserAgent, as shown below:

    use LWP;
    $browser = new LWP::UserAgent(keep_alive => 1);
  • Do not forget that the header of the response object usually has a lot of noteworthy information. You can obtain it through the headers_as_string and as_string functions. The following is an example returned using headers_as_string.
    use LWP;
    my $br = LWP::UserAgent->new;
    my $resp = $br ->get('http://www.pulse24.com');
    print $resp->headers_as_string";

    Output result:

    Cache-Control: private, max-age=0
    Connection: close
    Date: Sun, 16 Jan 2005 04:18:26 GMT
    Server: Microsoft-IIS/6.0
    Content-Length: 432
    Content-Type: text/html
    Content-Type: text/html; charset=iso8859-1
    Client-Date: Sun, 16 Jan 2005 04:18:09 GMT
    Client-Peer: 207.61.136.40:80
    Client-Response-Num: 1
    REFRESH: 0;URL=http://www.pulse24.com/Front_Page/page.asp
    X-Meta-Robots: noindex
    X-Powered-By: ASP.NET

    You can also use $ response-> header ('field') to obtain the desired special header. As in the preceding example, if the webpage to be accessed uses meta refresh:

    <META HTTP-EQUIV="REFRESH" CONTENT="0;URL=http://www.pulse24.com/Front_Page/page.asp">

    You can use $ response-> header ('refresh') to get the refresh url and choose whether to continue to follow up.

  • Sometimes, the address that the browser can access normally, but LWP cannot. generally, this is because the settings of your LWP header, referer, cookie, or user-agent are different from those of the other network server. to locate the problem, you need to compare the difference between the request sent by the browser and the request sent by your LWP, and then try again. most of the time, this is a complex task. I first used Ethereal to monitor and capture data. Currently, the LiveHTTPHeaders plug-in of Firefox is used. now LWP also comes with a data analysis module LWP: DebugFile to help you find the problem.
  • In addition, the article mentions HTTP: Cookies: Netscape. Now the LWP Cookies module supports more browsers, such as Mozilla, Safari, and Omniweb.
  • Most of the time, tables and javascript are used together, And LWP does not analyze the Javascript engine. Therefore, you must analyze the Javascript in the source code of the webpage to determine how to deal with them.
    function Submit()
    {
    .........
    self.document.location.href="verify.php";

    return false;
    }

    ........

    <form>
    ......
  • <Input type = button value = "Submit your page" onClick = "javascript: Submit (); return false; //">

    In the above example, the submit function of javascript is triggered through table submission, and verify. php is called. Now you can skip all javascript and submit verify. php directly.

    <Input type = button value = "Submit your page" onClick = "javascript: Submit (); return false; //">

    In the above example, the submit function of javascript is triggered through table submission, and verify. php is called. Now you can skip all javascript and submit verify. php directly.

**************************************** **************************************** * ** The Fuck Cutline ********************************* **************************************** ***********

Translator/Author: qiang
Source: Chinese Perl Association FPC (Foundation of Perlchina)
Author: Sean M. Burke-perl & lwp author (o''reilly)
Original Name: Web basics with lwp
Published on December 1, February 28, 2002
Original article: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
Please protect the author's copyright and preserve the crystallization of the author's work.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.