How to Use Python to crawl the website to be logged on ?, Python Login

Source: Internet
Author: User

How to Use Python to crawl the website to be logged on ?, Python Login

(Click the blue text above to follow us quickly)


Tzahi Vidas Compilation: bole online-ebigear

Http://python.jobbole.com/83588/


Recently, I have to crawl some webpages from a website that requires logon. It is not as simple as I thought, so I decided to write a secondary tutorial for it.

In this tutorial, We will crawl a project list from our bitbucket account.

The code in the tutorial can be found in my Github.

Follow these steps:

  1. Extract detailed information required for Logon

  2. Execute site Logon

  3. Crawl the required data


In this tutorial, I used the following package (which can be found in requirements.txt ):

Requests

Lxml

Step 1: Study the website's logon page

Go to the following page, "bitbucket.org/account/signin.pdf ". You will see the page shown in (execute logout to prevent you from logging on)

Carefully study the detailed information we need to extract for login

In this section, we will create a dictionary to save the details of the logon execution:

1. Right-click the "Username or email" field and select "view element ". We will use the value of the input box whose "name" attribute is "username. "Username" will be the key value, and our user name/email address will be the corresponding value (on other websites, these key values may be "email", "user_name ", "login", and so on ).


2. Right-click the "Password" field and select "view element ". In the script, we need to use the value of the input box whose "name" attribute is "password. "Password" will be the dictionary key value, and the password we enter will be the corresponding value (in other websites, the key value may be "user"Password "," loginPassword "," pwd ", etc ).


3. On the source code page, find a hidden input tag named "csrfmiddlewaretoken. "Csrfmiddlewaretoken" will be the key value, and the corresponding value will be the hidden input value (on other websites, this value may be called "csrfToken ","AuthenticationToken ). For example, "Vy00PE3Ra6aISwKBrPn72SFml00IcUV8 ".


Finally, we will get a dictionary like this:


Payload = {

"Username": "& lt; user name & gt ;",

"Password": "& lt; PASSWORD & gt ;",

"Csrfmiddlewaretoken": "& lt; CSRF_TOKEN & gt ;"

}


Remember, this is a specific case for this website. Although this logon form is simple, other websites may need to check the browser request log and find the key and value values that should be used in the logon step.

Step 2: log on to the website

For this script, we only need to import the following content:

Import requests

From lxml import html


First, we need to create a session object. This object allows us to save all login session requests.


Session_requests = requests. session ()


Second, we need to extract the csrf mark used for logon from this webpage. In this example, we use lxml and xpath to extract the data. We can also use regular expressions or other methods to extract the data.


Login_url = "https://bitbucket.org/account/signin? Next = /"

Result = session_requests.get (login_url)

 

Tree = html. fromstring (result. text)

Authenticity_token = list (set (tree. xpath ("// input [@ name = 'csrfmiddlewaretoken ']/@ value") [0]


**

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.