Take the watercress as an example, visit https://www.douban.com/contacts/list to see who you care about, and log in to view it.
If you use the Requests.get () method to get this HTTP, no login can only catch a login interface, so we have to use Python to log into the site to crawl the desired page.
An easy way to do this is to log in on your browser and then use the method (chrome as an example) to find your own cookie and user-agent, Then use Python to send the request with this copy of the header to replace the sent request has reached the purpose of login, the server will assume that you are logged in the user.
The code is as follows:
Importrequestsheaders= { 'user-agent':'mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) applewebkit/537.36 (khtml, like Gecko) chrome/63.0.3239.108 safari/537.36', 'Cookies':'gr_user_id=1f9ea7ea-462a-4a6f-9d55-156631fc6d45, Bid=vpypmmd30-k, ll= "118282"; ue= "Codin; __utmz= 30149280.1499577720.27.14.utmcsr=douban.com|utmccn= (referral) |utmcmd=referral|utmcct=/doulist/240962/; __utmv=30149280.3049; _VWO_UUID_V2=F04099A9DD; Viewed= "27607246_26356432"; Ap=1; Ps=y; push_noty_num=0; push_doumail_num=0; Dbcl2= "30496987:gzxpftzw4y0"; Ck=13ey; _pk_ref.100001.8cb4=%5b%22%22%2c%22%22%2c1515153574%2c%22https%3a%2f%2fbook.douban.com%2fmine%22%5d; __utma=30149280.833870293.1473539740.1514800523.1515153574.50; __utmc=30149280; _pk_id.100001.8cb4=255d8377ad92c57e.1473520329.20.1515153606.1514628010.'}r= Requests.get ('https://www.douban.com/contacts/list', headers =headers)Print(R.text)
Python crawler cheats browser for login by replacing HTTP request header