Getting Started crawler for a while, recently doing a pull-hook network Data Crawler analysis, the project is nearing the end, so take a time to write about the project encountered some problems.
The current pull hook net of the anti-reptile mechanism is still possible, a start with the scrapy shell analysis pull hook nets, found that pull hook network to verify useragent, and then access to a few times will be redirected to the login page, that is, pull hook network will verify cookies.
Here is the idea of a mock landing:
Pull-Hook Web login page: https://passport.lagou.com/login/login.html
Grab the bag and analyze it.
can be analyzed to simulate the landing required parameters:
url = "Https://passport.lagou.com/login/login.html"
PostData = {
' Isvalidate ': ' true ',
' username ': username,
' Password ': password,
' Request_form_verifycode ': ",
' Submit ': '
}
HEADERS = {
' Referer ': ' https://passport.lagou.com/login/login.html ',
' User-agent ': ",
' X-requested-with ': ' XMLHttpRequest ',
' X-anit-forge-token ': ",
' X-anit-forge-code ', ',
}
so how to get X-anit-forge-token, x-anit-forge-code these two parameters.
We open F12 carefully look at the login page source code
You can find these two values in the head tag, just use the positive to match them.
So how does the password for the login be encrypted?
From the source can be seen, login page loaded JS is not much, then one to find it.
In Main.html_aio_f95e644.js ("Https://img.lagou.com/passport/static/pkg/pc/page/login/main.html_aio_f95e644.js")
This JS inside found the method of encryption:
First encrypt the password once MD5: password = MD5 (password)
And then add veenike this string of characters: password = "Veenike" + password + "Veenike"
Finally again MD5 encryption: password = MD5 (password)
So here's the analysis, and here's the code to simulate the login.
#!/usr/bin/env python #-*-coding:utf-8-*-Import requests import Hashlib Import re #请求对象 session = Requests.session () #请求头信息 HEADERS = {' Referer ': ' https://passport.lagou.com/login/login.html ', ' user-agent ': ' mozilla/5.0 (Macintos H Intel Mac OS X 10.12; rv:51.0) gecko/20100101 firefox/51.0 ', Def Get_password (passwd): ' The password is MD5 double encrypted veennike This value is in MAIN.HTML_AIO_F9
5e644.js file found ' passwd = Hashlib.md5 (Passwd.encode (' Utf-8 ')). Hexdigest () passwd = ' veenike ' + passwd + ' veenike '
passwd = Hashlib.md5 (Passwd.encode (' Utf-8 ')). Hexdigest () return passwd def get_token (): Forge_token = "" Forge_code = "" "Login_page = ' https://passport.lagou.com/login/login.html ' data = Session.get (login_page, headers =headers) Match_obj = Re.match (R '. *x_anti_forge_token = \ ' (. *?) \';. *x_anti_forge_code = \ ' (\d+?) \ ', Data.text, re. Dotall) If Match_obj:forge_token = Match_obj.group (1) forge_code = Match_obj.group (2) return forg E_Token, Forge_code def login (username, passwd): X_anti_forge_token, X_anti_forge_code = Get_token () login_headers = Headers.copy () login_headers.update ({' X-requested-with ': ' XMLHttpRequest ', ' X-anit-forge-token ': X_Anti_Forge_ Token, ' X-anit-forge-code ': x_anti_forge_code}) postdata = {' Isvalidate ': ' true ', ' username ': Username, ' password ': Get_password (passwd), ' request_form_verifycode ': ', ' Submit ': ',} response = Session.post (' Https://passport.lagou.com/login/login.json ', Data=postdata, Headers=login_head ERS) print (response.text) def get_cookies (): Return Requests.utils.dict_from_cookiejar (session.cookies) if __nam e__ = = "__main__": username = ' 1371XXXXXXX ' passwd = ' xxxxxxxxxx ' login (username, passwd) print (Get_cookie S ())
Console results
A cookie can be obtained after a mock login to prepare the crawler. The code copy is able to run, but pull check the login check will change at any time, if the failure of the login can be found in the following comments, I will make time to update the code.
GitHub Address: Https://github.com/laichilueng/lagou_login
If you like, you can go for a compliment.