How to Use Python to crawl the website to be logged on ?, Python Login

Last Update:2018-01-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(Click the blue text above to follow us quickly)

Tzahi Vidas Compilation: bole online-ebigear

Http://python.jobbole.com/83588/

Recently, I have to crawl some webpages from a website that requires logon. It is not as simple as I thought, so I decided to write a secondary tutorial for it.

In this tutorial, We will crawl a project list from our bitbucket account.

The code in the tutorial can be found in my Github.

Follow these steps:

Extract detailed information required for Logon
Execute site Logon
Crawl the required data

In this tutorial, I used the following package (which can be found in requirements.txt ):

Requests

Lxml

Step 1: Study the website's logon page

Go to the following page, "bitbucket.org/account/signin.pdf ". You will see the page shown in (execute logout to prevent you from logging on)

Carefully study the detailed information we need to extract for login

In this section, we will create a dictionary to save the details of the logon execution:

1. Right-click the "Username or email" field and select "view element ". We will use the value of the input box whose "name" attribute is "username. "Username" will be the key value, and our user name/email address will be the corresponding value (on other websites, these key values may be "email", "user_name ", "login", and so on ).

2. Right-click the "Password" field and select "view element ". In the script, we need to use the value of the input box whose "name" attribute is "password. "Password" will be the dictionary key value, and the password we enter will be the corresponding value (in other websites, the key value may be "user"Password "," loginPassword "," pwd ", etc ).

3. On the source code page, find a hidden input tag named "csrfmiddlewaretoken. "Csrfmiddlewaretoken" will be the key value, and the corresponding value will be the hidden input value (on other websites, this value may be called "csrfToken ","AuthenticationToken ). For example, "Vy00PE3Ra6aISwKBrPn72SFml00IcUV8 ".

Finally, we will get a dictionary like this:

Payload = {

"Username": "& lt; user name & gt ;",

"Password": "& lt; PASSWORD & gt ;",

"Csrfmiddlewaretoken": "& lt; CSRF_TOKEN & gt ;"

}

Remember, this is a specific case for this website. Although this logon form is simple, other websites may need to check the browser request log and find the key and value values that should be used in the logon step.

Step 2: log on to the website

For this script, we only need to import the following content:

Import requests

From lxml import html

First, we need to create a session object. This object allows us to save all login session requests.

Session_requests = requests. session ()

Second, we need to extract the csrf mark used for logon from this webpage. In this example, we use lxml and xpath to extract the data. We can also use regular expressions or other methods to extract the data.

Login_url = "https://bitbucket.org/account/signin? Next = /"

Result = session_requests.get (login_url)

Tree = html. fromstring (result. text)

Authenticity_token = list (set (tree. xpath ("// input [@ name = 'csrfmiddlewaretoken ']/@ value") [0]

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to Use Python to crawl the website to be logged on ?, Python Login

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to Use Python to crawl the website to be logged on ?, Python Login

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support