How to Use Python to crawl the website to be logged on ?, Python Login
(Click the blue text above to follow us quickly)
Tzahi Vidas Compilation: bole online-ebigear
Http://python.jobbole.com/83588/
Recently, I have to crawl some webpages from a website that requires logon. It is not as simple as I thought, so I decided to write a secondary tutorial for it.
In this tutorial, We will crawl a project list from our bitbucket account.
The code in the tutorial can be found in my Github.
Follow these steps:
Extract detailed information required for Logon
Execute site Logon
Crawl the required data
In this tutorial, I used the following package (which can be found in requirements.txt ):
Requests
Lxml
Step 1: Study the website's logon page
Go to the following page, "bitbucket.org/account/signin.pdf ". You will see the page shown in (execute logout to prevent you from logging on)
Carefully study the detailed information we need to extract for login
In this section, we will create a dictionary to save the details of the logon execution:
1. Right-click the "Username or email" field and select "view element ". We will use the value of the input box whose "name" attribute is "username. "Username" will be the key value, and our user name/email address will be the corresponding value (on other websites, these key values may be "email", "user_name ", "login", and so on ).
2. Right-click the "Password" field and select "view element ". In the script, we need to use the value of the input box whose "name" attribute is "password. "Password" will be the dictionary key value, and the password we enter will be the corresponding value (in other websites, the key value may be "user"Password "," loginPassword "," pwd ", etc ).
3. On the source code page, find a hidden input tag named "csrfmiddlewaretoken. "Csrfmiddlewaretoken" will be the key value, and the corresponding value will be the hidden input value (on other websites, this value may be called "csrfToken ","AuthenticationToken ). For example, "Vy00PE3Ra6aISwKBrPn72SFml00IcUV8 ".
Finally, we will get a dictionary like this:
Payload = {
"Username": "& lt; user name & gt ;",
"Password": "& lt; PASSWORD & gt ;",
"Csrfmiddlewaretoken": "& lt; CSRF_TOKEN & gt ;"
}
Remember, this is a specific case for this website. Although this logon form is simple, other websites may need to check the browser request log and find the key and value values that should be used in the logon step.
Step 2: log on to the website
For this script, we only need to import the following content:
Import requests
From lxml import html
First, we need to create a session object. This object allows us to save all login session requests.
Session_requests = requests. session ()
Second, we need to extract the csrf mark used for logon from this webpage. In this example, we use lxml and xpath to extract the data. We can also use regular expressions or other methods to extract the data.
Login_url = "https://bitbucket.org/account/signin? Next = /"
Result = session_requests.get (login_url)
Tree = html. fromstring (result. text)
Authenticity_token = list (set (tree. xpath ("// input [@ name = 'csrfmiddlewaretoken ']/@ value") [0]
**