The principle and implementation of Perl login to various websites
Tkorays ([email protected])
Children only practice the sermon, and adults speak only.
Writing a script to crawl a page is easy, but it's not always easy. Some pages must be logged in to be able to view, such as you want to crawl the course of the system's own results? So, here's how to use the Perl login Web site, and the implementation.
Principle
If you understand some of the HTTP principles, this is a good idea. Using a browser to open a webpage is nothing more than: you send a request, and then the server responds, giving you the content of the page you need. Of course, these requests and corresponding all have a certain regularity.
First, the browser sends the request .
Requests often have post and get two kinds, skim the details, there is a big difference between the post pass parameters are not displayed in the URL, and the Get method passed parameters are displayed in the URL. And we are logged in with a form submission, often with post and get. For example, I search for Apache in open source China:
This form is get mode, so there are parameters in the URL.
So, whether the browser sent so much data to the server, the answer is of course, no. As in, click the Start message, the request header is as follows:
In fact, this data is sent, so if you want to do it yourself in C + +, do not forget to send the necessary data oh. (hint, each data use \ r \ n Split) If you use Perl, there are ready-made libraries, you do not need to pay attention to too much detail, just need to know, may need to have cookie!
Therefore, request you only need to focus on the parameters of the request and the cookies that may be required.
Then say the corresponding .
Similarly, the HTML page returned by the browser is more than what you see in the source code, and also includes the response header.
The corresponding header and the request header are similar, the main high-speed browser how to deal with these corresponding. Note that the content-type above does not, and the subsequent text/html indicates that the returned data is in HTML format. (When you return to JavaScript, Content-type is application/x-javascript
). Of course not all of them have response headers, and most of the time, you don't need to focus on them. Here I just gather words to say, completely can ignore. Back to the page, you simply need to get the back of the head (anyway, you know whether to return HTML or JavaScript, why not judge it).
Cookies are small data stored on the browser side that can be used to hold some information. It is actually some key-value pairs. Many times, the server generates some data to the browser, which is important in the subsequent communication process. Fortunately, Perl has a ready-made library for us to manage cookies, and if you're going to do it, it's a big step.
By the same note, there are URL coding issues that sometimes occur.
Realize
Said the principle of course to say the realization, or I am cheating the reader's feelings.
The main package used in Perl is the LWP, which uses its useragent, Cookies, Response and other classes. These classes have no idea where you can view the CPAN.
Here we need to simulate the behavior of a browser, so we want to create a UserAgent object.
My $ua = lwp::useragent->new; $ua->agent ("mozilla/5.0 (Windows NT 6.1; rv:30.0) gecko/20100101 firefox/30.0 ");
In this way, the browser inside the request header becomes Firefox, and the server will think we are using Firefox.
Do not forget cookies oh.
My $cookie _jar = http::cookies->new (file=> ' lwp_cookies.txt ', autosave=>1, ignore_discard=>1); $ua-> ; Cookie_jar ($cookie _jar);
Then call UserAgent's post and get method and do whatever you want.
If you are not very clear, here is an example of a login for everyone concurrency, a good understanding of it. There are comments:
#!/usr/bin/perl# copyright 2014 tkorays. all rights reserved.# author TKORAYS# EMAIL [EMAIL PROTECTED]USE STRICT;USE WARNINGS;USE LWP;USE LWP:: simple;use lwp::useragent;use http::cookies;use http::headers;use http::response;use encode;use uri::escape;use uri::url;my $email = ' ***@**.com ';my $password = ' * * * ';my $domain = ' renren.com ';my $hostid = ';my $requestToken = ';my $RTK = ";my $channel = ' Renren ';my $ua = LWP::UserAgent->new; $ua->agent (" mozilla/5.0 ( windows nt 6.1; rv:30.0) gecko/20100101 firefox/30.0 ");my $cookie _jar = http::cookies->new ( file=> ' lwp_cookies.txt ', autosave=>1, ignore_discard=>1); $ua->cookie_jar ($cookie _jar);my $login _ url = ' HTTP://WWW.renren.com/plogin.do ';# This does not determine whether the need for verification code, smart you will know how to finish after learning # everyone is post landing, the first parameter is the address of the login, the second parameter is an anonymous Hashmy $res = $ua->post ($login _url,{ ' email ' =>$ email, ' password ' = $password, ' domain ' = $domain});my $homepage; # determine the location within the response header to determine if the login was successful if ($res->header (' location ') eq ' http://www.renren.com/Home.do ') { print ' Login ok ... ', ' \ n '; $homepage = $ua->get (' http:// Www.renren.com/home '); }else{ exit;} # as a welfare, the following or affixed, the state of the bar, the following is not annotated #################################### #if ($homepage->is_success) { my $pagect = $homepage->content; $pagect =~ /id \s:\s "(\d+)"/g; $hostid = $1; $pagect =~ /requesttoken\s:\s ' (. +) '/g; $requestToken = $1; $pagect =~ /_rtk\s:\s ' (. +) '/; $rtk = $1; }else { exit; }my $purl = ' http://shell.renren.com/'. $ HostID. ' /status ';my ($sec, $min, $hour, $day, $mon, $year, $wday, $yday, $ISDST) = localtime (); $year +=1900; $mon ++;my $postret = $ua->post ($purl,{ ' content ' = ' Renren test,by perl script,author:tkorays,date: $year-$mon-$day $hour: $min: $sec. ", ' HostID ' = $hostid, ' requesttoken ' + $requestToken, ' _rtk ' = $RTK, ' channel ' = = $channel}); if ($postret->is_success) { print ' Send ok ... ', ' \ n ';} else{ print ' fuck! ';}
What if I have a verification code? UserAgent get, method to get AH.
Let's give a simple example:
My $res = $ua->get ($url. ' /genimg '); if (! $res->is_success) {return 0; } open (File_handle, ' >img.jpg '); Binmode File_handle; Print File_handle $res->content; Close File_handle;
The above code saves the verification code as a picture. So the verification code problem is solved.
GO
If the problem is solved, will you act quickly?