Http://hi.baidu.com/0soul/blog/item/91098701f5051880e850cd4b.html
Original article: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
There are a lot of copyright information, and I will write the rest at the end.
I think this Perl LWP document is quite good, so I will turn it over for convenient reference at any time.
Below is the reprinted content:
**************************************** **************************************** * ** The Fuck Cutline ********************************* **************************************** ***********
Introduction to LWP and WEB
Lwp ("library for WWW in Perl") is a group of modules used to obtain network data. It is the same as many Perl modules. Each lwp module has its own detailed documentation as a complete introduction to this module. However, in the face of many modules in lwp, even if it is the simplest job, new users often do not know where to start. A full introduction to lwp requires a whole book. Fortunately, Perl & lwp has been published. This article introduces you to the most common lwp usage.
Use LWP: Simple to get the webpage
If you only want to get a webpage, the functions in lwp: simple are the simplest. Call the get ($ URL) function to obtain the URL content. If no error occurs, the get function returns this webpage. Otherwise, the function returns UNDEF. Example:
My $ url = 'HTTP: // freshair.npr.org/dayfa.cfm? Todaydate = Current'
Use lwp: simple;
My $ content = get $ URL;
Die "couldn't get $ URL" Unless defined $ content;
# $ Content contains webpage content, which is analyzed below:
If ($ content = ~ M/jazz/I ){
Print "they're talking about jazz today on fresh air! N ";
} Else {
Print "fresh air is apparently jazzless today. N ";
}
If you want to run it in the command line, the getprint function is very convenient. If no error occurs, the webpage content is output to stdout; otherwise, an error message is output to stderr. For example:
% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"
The preceding URL points to a text file, listing the files updated by CPAN in the last two weeks. If you want to know whether the Acme: module has been updated, You can e-mail yourself. You can combine it with Shell to implement it. As follows:
% perl -MLWP::Simple -e "getprint 'http://cpan.org/RECENT'"
| grep "/by-module/Acme" | mail -s "New Acme modules! Joy!" $USER
LWP: Simple has some very useful functions, including a function that runs the HEAD request to check whether the link is valid and whether the webpage is updated. The other two functions are also worth mentioning. For details, see the LWP: Simple document or the second chapter of Perl & LWP.
LWP Class model Basics
LWP: Simple is very convenient for Simple work. However, cookies, user authentication, editing of HTTP request headers, and reading and writing of HTTP resonse headers (mainly HTTP error messages) are not supported ). Therefore, when these features are required, the LWP Class model must be used. Among many LWP classes, LWP: UserAgent and HTTP: Response are required. LWP: UserAgent is like a virtual browser used for request ). HTTP: Response is used to store the response generated by the request ).
The most basic usage is $ response = $ browser-> get ($ url), or write more completely:
# Program start:
Use LWP 5.64; # load a newer version of LWP classes
My $ browser = LWP: UserAgent-> new;
...
# Get request:
My $ url = 'HTTP: // freshair.npr.org/dayFA.cfm? TodayDate = current ';
My $ response = $ browser-> get ($ url );
Die "can't get $ URL --", $ response-> status_line
Unless $ response-> is_success;
Die "hey, I want HTML format instead of", $ response-> content_type
Unless $ response-> content_type EQ 'text/html ';
# Or any other Content-Type
# If it succeeds, it will process the content
If ($ response-> content = ~ M/jazz/I ){
Print "fresh air is discussing jazz today! N ";
} Else {
Print "the fresh air discussed today does not touch jazz at all. N ";
}
There are two related objects: $ browser, which is an object of lwp: useragent. $ Response is an object of the http: Response class. Only one $ browser object is required in a program, but each time a request is sent, a new http: Response object is obtained. HTTP: Response object has the following valuable attributes:
- A status code value, indicating success or failure. You can use $ response-> is_success to detect it.
- HTTP status line (HTTP status description). Observing the result of $ response-> status_line (such as "404 Not Found") will help you understand the meaning of the word.
- Mime Content-Type (file type) is obtained through $ response-> content_type. For example, text/html, image/GIF, and application/XML.
- Content of the response (response returned content) is stored in $ response-> content. The content may be in HTML format. For GIF format, $ response-> content contains binary GIF data.
- Many other methods can be found in http: Response and Its superclasses (parent class) http: Message and http: headers.
Add other HTTP request headers
The common request method is $ response = $ browser-> get ($ URL, you can add other HTTP headers to your request in the $ URL followed by a key value list. Like this:
$response = $browser->get( $url, $key1, $value1, $key2, $value2, ... );
For example, if you want to send a request to a website that only allows the Netscape Browser to connect to, you need to send a header similar to Netscape, as shown below:
my @ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
...
$response = $browser->get($url, @ns_headers);
If you do not want to use this array again, you can write it into the get function.
$response = $browser->get($url,
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
If you only want to modify the User-Agent, you can use the LWP: UserAgent agent method to change the default agent 'libwww-perl/123456' (or other.
$browser->agent('Mozilla/4.76 [en] (Win98; U)');
Use cookies
The default LWP: UserAgent object works like a browser that does not support cookies. There are more than one way to set its cookie_jar attribute to support cookies. "Cookie jar" is a container used to store HTTP cookies. You can save the token to a hard disk (like netscapeuses cookies.txt) or memory. Cookies stored in the memory will disappear after the program is completed. Memory-based cookie usage:
$browser->cookie_jar({});
You can also store cookies in files on the hard disk:
Use HTTP: Cookies;
$ Browser-> cookie_jar (HTTP: Cookies-> new (
'File' => '/some/where/cookies. lwp ',
# Cookie storage address
'Autosave' => 1,
# Automatically stored in the hard disk after completion
));
The cookie in the file is stored in the LWP custom format. If you want to use this cookie file in netscape, you can use HTTP: Cookies: Netscape class:
use HTTP::Cookies;
# yes, loads HTTP::Cookies::Netscape too
$browser->cookie_jar( HTTP::Cookies::Netscape->new(
'file' => 'c:/Program Files/Netscape/Users/DIR-NAME-HERE/cookies.txt',
# where to read cookies
));
You can also use 'autosave' => 1 as above. However, Netscape cookies are sometimes lost before being written to the hard disk, at least in this article.
Submit a table through POST
Most HTML tables use html post to submit data to the server. Here you can:
$response = $browser->post( $url,
[
formkey1 => value1,
formkey2 => value2,
...
],
);
Or you can send the HTTP header together.
$response = $browser->post( $url,
[
formkey1 => value1,
formkey2 => value2,
...
],
headerkey1 => value1,
headerkey2 => value2,
);
The next example is to send an http post request to the search engine of AltaVista, and then extract the total number of matches from the HTML.
Use strict;
Use warnings;
Use lwp 5.64;
My $ browser = lwp: useragent-> new;
My $ word = 'tarragon ';
My $ url = 'HTTP: // www.altavista.com/sites/search/web ';
My $ response = $ browser-> post ($ URL,
['Q' => $ word, # The AltaVista query string
'Pg '=> 'Q', 'avkw' => 'tgz', 'kl '=> 'XX ',
]
);
Die "$ URL error:", $ response-> status_line
Unless $ response-> is_success;
Die "weird content type at $ URL --", $ response-> content_type
Unless $ response-> content_type EQ 'text/html ';
If ($ response-> content = ~ M {AltaVista found ([0-9,] +) Results }){
# Matching results from "AltaVista found 2,345 results"
Print "$ word: $ 1n ";
} Else {
Print "Couldn't find the match-string in the responsen ";
}
Submit a table through GET
Some HTML tables use GET requests to transmit data instead of POST requests. For example, if you search for the movie name 'Blade Runner' in imdb.com and submit the file, it will be displayed in the URL bar of the browser:
http://us.imdb.com/Tsearch?title=Blade%20Runner&restrict=Movies+and+TV
The following is the same result with LWP:
use URI;
my $url = URI->new( 'http://us.imdb.com/Tsearch' );
# makes an object representing the URL
$url->query_form( # And here the form data pairs:
'title' => 'Blade Runner',
'restrict' => 'Movies and TV',
);
my $response = $browser->get($url);
Chapter 2 describes HTML tables and table data in detail. Chapter 2 to Chapter 2 describes how to extract useful information from the obtained HTML data.
URL Processing
The URI class mentioned above provides many methods to get and modify URLs. For example, if you want to know the type of url (http, ftp, etc.), you can use $ url-> schema. If you want to extract the host name from the url, you can use $ url-> host. However, the most useful method is the query_form method I mentioned earlier, and the relative URL path (such ".. /foo.html) to the new_abs method of the absolute path (for example, "http: // www.perl.com/stuff/foo.html. Example:
use URI;
$abs = URI->new_abs($maybe_relative, $base);
Now let's recall the example of getting the latest CPAN module.
use strict;
use warnings;
use LWP 5.64;
my $browser = LWP::UserAgent->new;
my $url = 'http://www.cpan.org/RECENT.html';
my $response = $browser->get($url);
die "Can't get $url -- ", $response->status_line
unless $response->is_success;
my $html = $response->content;
while( $html =~ m/chunk86920392chunklt;A href=\"(.*?)\"/g ) {
print "$1n";
}
The output result is
MIRRORING.FROM
RECENT
RECENT.html
authors/00whois.html
authors/01mailrc.txt.gz
authors/id/A/AA/AASSAD/CHECKSUMS
...
You can use the new_abs method of the URI module to obtain the full URL path and modify the while loop:
while( $html =~ m/<A href=\"(.*?)\"/g ) {
print URI->new_abs( $1, $response->base ) ,"n";
}
The $ response-> base method can be found in HTTP: Message. The returned URL is usually used to merge with the relative path to obtain the full path. The result is:
http://www.cpan.org/MIRRORING.FROM
http://www.cpan.org/RECENT
http://www.cpan.org/RECENT.html
http://www.cpan.org/authors/00whois.html
http://www.cpan.org/authors/01mailrc.txt.gz
http://www.cpan.org/authors/id/A/AA/AASSAD/CHECKSUMS
...
For more details about URI objects, see chapter 4 of Perl & LWP. Of course, it is relatively simple to use regexp (Regular Expression) to match URLs. If the situation is complex and you need more powerful matching tools, consider the HTML analysis module HTML: LinkExtor or HTML: TokeParser, even HTML: TreeBuilder
Other browser attributes
LWP: UserAgent objects has several noteworthy attributes:
For details, see the LWP: UserAgent documentation.
Write a polite robot
If you want to follow robots.txt and avoid sending too many requests in a short time, you can use LWP: RobotUA instead of LWP: UserAgent. The usage of LWP: RobotUA is the same as that of LWP: UserAgent:
Use LWP: RobotUA;
My $ browser = LWP: RobotUA-> new (
'Yoursuperbot/100', 'you @ yoursite.com ');
# Robot name and email address
My $ response = $ browser-> get ($ url );
HTTP: RobotUA has the following features:
- If the $ URL request server's robots.txt disables your access to $ URL, $ browser will not send a request to this address, instead, code 403 and an error message "forbidden by robots.txt" are returned ".
die "$url -- ", $response->status_line, "nAborted"
unless $response->is_success;
Then you will get the following error message:
http://whatever.site.int/pith/x.html -- 403 Forbidden
by robots.txt
Aborted at whateverprogram.pl line 1234
- If $ browser finds that the request address has just been requested, it will suspend (sleep) to avoid sending too many requests. The default value is 1 minute, but can be set through $ browser-> delay (minutes. For example:
$browser->delay( 7/60 );
For details, see the LWP: RobotUA documentation.
Use proxy
Sometimes you want (or must) to connect to some sites or protocols through a proxy. For example, your LWP program runs on a machine in the firewall. The proxy is usually stored in the environment variable HTTP_PROXY. LWP can load the proxy address in the environment variable through the env_proxy function in the user-agent object.
use LWP::UserAgent;
my $browser = LWP::UserAgent->new;
# And before you go making any requests:
$browser->env_proxy;
For details, see the proxy, env_proxy and no_proxy methods in the LWP: UserAgent document.
HTTP Authentication
Many websites use HTTP authentication to restrict connections. when a user requests a restricted page, the HTTP server replies "That document is part of a protected 'realm' and you can access it only if you re-request it and add some special authorization headers to your request ". (You have requested a restricted area. If you need to resend a header with authentication information, you can connect to it .) unicode.org administrators must perform HTTP authentication to prevent robots from accessing the contact list to obtain the sender address. The user name and password are public: User name: unicode-ml password: unicode
Assume that the address of a restricted page is
http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
If you request this address in your browser, a new window will pop out and "Enter username and password for 'unicode-MailList-Archives 'at server 'www .unicode.org'" will be displayed '". Enter the user name and password, as shown in the following figure:
Simply use LWP to request this URL:
use LWP 5.64;
my $browser = LWP::UserAgent->new;
my $url =
'http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html';
my $response = $browser->get($url);
die "Error: ", $response->header('WWW-Authenticate') ||
'Error accessing',
# ('WWW-Authenticate' is the realm-name)
"n ", $response->status_line, "n at $urln Aborting"
unless $response->is_success;
You will get the following error:
Error: Basic realm="Unicode-MailList-Archives"
401 Authorization Required
at http://www.unicode.org/mail-arch/unicode-ml/y2002-m08/0067.html
Aborting at auth1.pl line 9. [or wherever]
This is because LWP does not know the user name and address of the "Unicode-MailList-Archives" Area in host www.unicode.org. The simplest way to solve this problem is to use the credentials method to provide the user name and password:
$browser->credentials(
'servername:portnumber',
'realm-name',
'username' => 'password'
);
Generally, the port is 80. The credentials function must be called before sending a request. For example:
$browser->credentials(
'reports.mybazouki.com:80',
'web_server_usage_reports',
'plinky' => 'banjo123'
);
Our unicode.org example can be written
$browser->credentials( # add this to our $browser 's "key ring"
'www.unicode.org:80',
'Unicode-MailList-Archives',
'unicode-ml' => 'unicode'
);
Connect to HTTPs URLs
If you have HTTPs support for LWP installation, you can access HTTPs URLs in the same way as HTTP.
use LWP 5.64;
my $url = 'https://www.paypal.com/'; # Yes, HTTPS!
my $browser = LWP::UserAgent->new;
my $response = $browser->get($url);
die "Error at $urln ", $response->status_line, "n Aborting"
unless $response->is_success;
print "Whee, it worked! I got that ",
$response->content_type, " document!n";
If HTTPs is not supported, the following error message is displayed.
Error at https://www.paypal.com/
501 Protocol scheme 'https' is not supported
Aborting at paypal.pl line 7. [or whatever program and line]
If you have installed HTTPS support for LWP, your request should be successful. You can process the $ response object just like a normal HTTP request. For information on installing HTTPS support, see the README. SSL file in libwww-perl.
Get large files
When a large file is requested, common request methods (for example, $ response = $ browser-> get ($ url) may cause memory problems. because $ response stores the entire file. it may not be wise to request a 30 MB file. one solution is to save the file to the hard disk.
$response = $ua->get($url,
':content_file' => $filespec,
);
For example:
$response = $ua->get('http://search.cpan.org/',
':content_file' => '/tmp/sco.html'
);
When content_file is used, headers is still in $ response, but $ response-> content is empty. it is worth noting that versions earlier than LWP 5.66 do not support content_file. you should use LWP 5.66; if your program may run on a lower version of LWP, you can also use the following example to ensure compatibility, which has the same effect as content_file.
use HTTP::Request::Common;
$response = $ua->request( GET($url), $filespec );
Resources
The above is just an introduction to common LWP functions. For more information about LWP and LWP, see the following documents.
- Lwp: simple: Provides simple get, Head, and mirror methods.
- Lwp: libwww-perl module Overview
- HTTP: Response: The response obtained after the LWP request is sent. $ Response = $ browser-> get (...).
- HTTP: Message and HTTP: Headers: HTTP: Response many methods come from both.
- URI: Class processing completely and relative URL path.
- URI: Escape: To correctly process and convert irregular characters in the URL (for example, the conversion between "this & that" and "this % 20% 26% 20that ).
- HTML: Entities: To correctly process and convert irregular characters in HTML (such as "C. & E. Bront ??" And "C. & E. Bront ??" ).
- HTML: TokeParser and HTML: TreeBuilder: Classes analysis HTML
- HTML: LinkExtor: Class: Find the link in HTML
- Of course, there is my Perl & LWP.
Notes after discussion with Sean Burke
When translating this article, I contacted the author Sean Burke. he agreed to point out the content to be supplemented and updated in the original article. I also communicated with the author through MSN. here I will write my understanding of LWP and my supplement to the author here.
- I mentioned in the article that
use LWP::ConnCache;
$browser->conn_cache(LWP::ConnCache->new()):
This tells browser object to use the HTTP/1.1 "keep-Alive" feature, that is, repeat the previous socket to speed up the request. you can also add the "keep-Alive" feature to $ browser in new LWP: UserAgent, as shown below:
use LWP;
$browser = new LWP::UserAgent(keep_alive => 1);
- Do not forget that the header of the response object usually has a lot of noteworthy information. You can obtain it through the headers_as_string and as_string functions. The following is an example returned using headers_as_string.
use LWP;
my $br = LWP::UserAgent->new;
my $resp = $br ->get('http://www.pulse24.com');
print $resp->headers_as_string";
Output result:
Cache-Control: private, max-age=0
Connection: close
Date: Sun, 16 Jan 2005 04:18:26 GMT
Server: Microsoft-IIS/6.0
Content-Length: 432
Content-Type: text/html
Content-Type: text/html; charset=iso8859-1
Client-Date: Sun, 16 Jan 2005 04:18:09 GMT
Client-Peer: 207.61.136.40:80
Client-Response-Num: 1
REFRESH: 0;URL=http://www.pulse24.com/Front_Page/page.asp
X-Meta-Robots: noindex
X-Powered-By: ASP.NET
You can also use $ response-> header ('field') to obtain the desired special header. As in the preceding example, if the webpage to be accessed uses meta refresh:
<META HTTP-EQUIV="REFRESH" CONTENT="0;URL=http://www.pulse24.com/Front_Page/page.asp">
You can use $ response-> header ('refresh') to get the refresh url and choose whether to continue to follow up.
- Sometimes, the address that the browser can access normally, but LWP cannot. generally, this is because the settings of your LWP header, referer, cookie, or user-agent are different from those of the other network server. to locate the problem, you need to compare the difference between the request sent by the browser and the request sent by your LWP, and then try again. most of the time, this is a complex task. I first used Ethereal to monitor and capture data. Currently, the LiveHTTPHeaders plug-in of Firefox is used. now LWP also comes with a data analysis module LWP: DebugFile to help you find the problem.
- In addition, the article mentions HTTP: Cookies: Netscape. Now the LWP Cookies module supports more browsers, such as Mozilla, Safari, and Omniweb.
- Most of the time, tables and javascript are used together, And LWP does not analyze the Javascript engine. Therefore, you must analyze the Javascript in the source code of the webpage to determine how to deal with them.
function Submit()
{
.........
self.document.location.href="verify.php";
return false;
}
........
<form>
......
<Input type = button value = "Submit your page" onClick = "javascript: Submit (); return false; //">
In the above example, the submit function of javascript is triggered through table submission, and verify. php is called. Now you can skip all javascript and submit verify. php directly.
<Input type = button value = "Submit your page" onClick = "javascript: Submit (); return false; //">
In the above example, the submit function of javascript is triggered through table submission, and verify. php is called. Now you can skip all javascript and submit verify. php directly.
**************************************** **************************************** * ** The Fuck Cutline ********************************* **************************************** ***********
Translator/Author: qiang
Source: Chinese Perl Association FPC (Foundation of Perlchina)
Author: Sean M. Burke-perl & lwp author (o''reilly)
Original Name: Web basics with lwp
Published on December 1, February 28, 2002
Original article: http://www.perl.com/pub/a/2002/08/20/perlandlwp.html
Please protect the author's copyright and preserve the crystallization of the author's work.