Python crawler personal record (iv) Use Python to write a diary on the watercress

Source: Internet
Author: User
Tags xpath

Related keywords: Requests library Requests.post method Cookies Login

First, the purpose of analysis

Use cookies to log in to the watercress and write a diary
https://www.douban.com/note/636142594/

Second, step analysis

1. Use the browser to log in the watercress, to obtain and analyze cookies

2, using a cookie simulation login watercress (use the account password login can also, need to verify the code, the time limit of the cookie is usually a few days)

3. Analyzing the browser journaling behavior, simulating post behavior in Python

4. Source code and testing

Three, scrapy shell simulation landing

1, use the browser to log in the watercress, in the fidder to obtain a cookie

There are many items in the cookie (not all required), and after a test, it is found that you can log in as long as you include ' Dbcl2 '

2. Open scrapy Shell Test login

Simulate browser user-agent and cookies

$ scrapy Shell ... fromscrapy Import requestcookies= {'Dbcl2':'"164753551:kjyotngwwii"'}headers={'user-agent':'mozilla/5.0'}req= Request ('https://www.douban.com/mine/', headers=headers,cookies =cookies) Fetch (req) #使用浏览器检查元素得到xpath (method reference Crawler (i) (ii)) (diary content permissions are not visible, if you can see the diary content to simulate the successful landing)>>> Response.xpath ('//*[@id = "Note_636142594_short"]'). Extract () ['<div class= "note" id= "Note_636142594_short" >hello douban</div>']>>> Response.xpath ('//*[@id = "Note_636142594_short"]/text ()'). Extract () ['Hello Douban']>>>

Get diary content, visible simulation login successful, Cookie available

Iv. Python writes watercress Diary

1. Use the browser to write a diary and observe the behavior in the Fidder

Discover that the browser has performed post https://www.douban.com/note/create http/1.1 behavior

The content of the post is Ck=bsjh&note_id=636142544&note_title=test_2&note_text=hello2&author_tags=&note_ Privacy=p

CK=BSJH is a value in a cookie

note_id=636142544 (estimated user ID, copy directly)

Note_id=636142544&note_title=test_2&note_text=hello2 (title, and content)

The other three parameters are not important, use the default on the line

2. Using Python to simulate post behavior

#post the required parameters

Requests.post (url = url,data = Data,headers=headers,verify=false,cookies = cookies)

Five, source code and testing
Source
1 Import Requests2###1, first login to any page, get cookies3 4 #使用requests打开https时会产生warming, plus this block5 requests.packages.urllib3.disable_warnings ()6 7headers =dict ()8headers['user-agent'] ='mozilla/5.0 (Windows NT 10.0; WOW64) applewebkit/537.36 (khtml, like Gecko) chrome/53.0.2785.104 safari/537.36 core/1.53.3387.400 qqbrowser/ 9.6.11984.400'9 Tencookies =dict () Onecookies = {#'ll':'"118201"', A#'Bid':'Puwfxi53mha', -#'_ga':'GA1.2.1759080547.1501749204', -#'__yadk_uid':'Rjmlgzyjjuhi5lhnhjx3logbaltgb5xy', the#'gr_user_id':'16c2c492-9e32-4af2-9c35-230e8d43db06', -#'PS':'y', -#'_PK_REF.100001.8CB4':'%5b%22%22%2c%22%22%2c1504529257%2c%22https%3a%2f%2faccounts.douban.com%2flogin%3fredir%3dhttps%253a%252f% 252fwww.baidu.com%252flink%253furl%253deh3ngsbwz6s0p2oqc7qhrezckdwjewbljfnbprtrwkv4qwolsccwkcsh9iqfedax%2526wd %253D%2526EQID%253D8191D1C1000627560000000359AD43F4%22%5D', -#'AP':'1', +#'_VWO_UUID_V2':'57D26B154CE7E363177CFD5F35F06F34|E63FA1BFE4C07598B6454AE2A97166CB', -             'Dbcl2':'"164753551:kjyotngwwii"' +#'ck':'Osar', A#'_PK_ID.100001.8CB4':'70e88acbc88cb16d.1501749196.11.1504530290.1504527380.', at#'_PK_SES.100001.8CB4':'*', -#'Push_noty_num':'0', -#'Push_doumail_num':'0', -#'__utma':'30149280.1759080547.1501749204.1504529257.1504530054.20', -#'__UTMB':'30149280.5.10.1504530054', -#'__UTMC':'30149280', in#'__UTMZ':'30149280.1504530054.20.16.UTMCSR', -#'__UTMV':'30149280.16475' to             } +  -data = {'ck':'BSJH', the         'note_id':'636142544', *         'Note_title':'Hellopython', $         'Note_text':'Hellopython'Panax Notoginseng#'Author_tags':"', -#'note_privacy':'P' the     } +URL ='https://www.douban.com/note/create' A#注意访问https链接时要加上verify =false parameter, otherwise the return is wrong theret = requests.post (url =URL, +data =data, -headers=headers, $verify=False, $cookies =Cookies -                     ) -Print (ret.text[: -]) thePrint (Ret.cookies.get_dict ())
View Code

Test results

Done!

V. Summary and Analysis

1, this time using cookies to avoid the verification code trouble, next time hope to study the crack of verification code

2, the use of cookies is limited, a period of time will be replaced

3, requests restrictions on HTTPS is very strict, need to join Verify=false, and to block the warning message

#使用requests打开https时会产生warming, plus this block
Requests.packages.urllib3.disable_warnings ()



Python crawler personal record (iv) Use Python to write a diary on the watercress

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.