Chapter 4 scrapy crawls well-known Q & A websites and Chapter 4 scrapy Crawlers

Source: Internet
Author: User

Chapter 4 scrapy crawls well-known Q & A websites and Chapter 4 scrapy Crawlers

In chapter 5, it seems that the practice project in Chapter 4 is nothing more than a simulated logon.

The records are recorded in different sections and the knowledge points are directly added, which may be messy.

1. Common httpcode:

2. How to find the post parameter?

First, find the logon page, open firebug, enter the wrong account and password, and observe the post_url conversion to determine the parameters.

3. read local files and generate cookies.

1 try:2     import cookielib #py23 except:4     import http.cookiejar as cookielib #py3

4. log on to zhihu with requests

1 #-*-coding: UTF-8-*-2 _ author _ = 'jinxiao' 3 4 import requests 5 try: 6 import cookielib 7 response T: 8 import http. cookiejar as cookielib 9 10 import re11 12 session = requests. session () # instantiate the session. The following requests can be directly replaced with session13 session. cookies = cookielib. LWPCookieJar (filename = "cookies.txt") # instantiate cookies and save cookies14 # Read cookies15 try: 16 session. cookies. load (ignore_discard = True) 17 bytes T: 18 print ("Cookie failed to load") 19 20 # You must add the browser header to your website. Other websites may not necessarily use the 21 agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; rv: 51.0) Gecko/20100101 Firefox/51.0 "22 header = {23" HOST ":" www.zhihu.com ", 24" Referer ":" https://www.zhizhu.com ", 25 'user-agent': agent26} 27 28 def is_login (): 29 # determine whether the logon status is 30 inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773" 31 response = session by returning the status code from the personal Center page. get (inbox_url, heade Rs = header, allow_redirects = False) # disable redirection and determine whether to log on to 32 if response. status_code! = 200:33 return False34 else: 35 return True36 37 def get_xsrf (): 38 # Get xsrf code39 response = session. get ("https://www.zhihu.com", headers = header) 40 match_obj = re. match ('. * name = "_ xsrf" value = "(. *?) "', Response. text) 41 if match_obj: 42 return (match_obj.group (1) 43 else: 44 return "" 45 46 47 def get_index (): 48 response = session. get ("https://www.zhihu.com", headers = header) 49 with open ("index_page.html", "wb") as f: 50 f. write (response. text. encode ("UTF-8") 51 print ("OK") 52 53 def zhihu_login (account, password): 54 # log on to 55 if re. match ("^ 1 \ d {10}", account): 56 print ("phone number login") 57 post_url = "https://www.zhihu.com/login/phone_num" 58 post_data = {59 "_ xsrf ": get_xsrf (), 60 "phone_num": account, 61 "password": password62} 63 else: 64 if "@" in account: 65 # determine whether the user name is mailbox 66 print ("Mailbox login") 67 post_url = "https://www.zhihu.com/login/email" 68 post_data = {69 "_ xsrf": get_xsrf (), 70 "email": account, 71 "password": password72} 73 74 response_text = session. post (post_url, data = post_data, headers = header) 75 session. cookies. save () 76 77 zhihu_login ("18782902568", "admin123") 78 # get_index () 79 print (is_login ())Zhihu_requests_login

5. Add UserAgent in shell debugging

Scrapy shell-s USER_AGENT = '...' url

6. JsonView plug-in

Json can be viewed visually.

7. write html files

with open(''e:/zhihu.html'',"wb") as f:    f.write(response.text.encode('utf-8'))

8. yield understanding

If yield item is used, it will be processed in pipelins.

If yield Request is used, it will be downloaded in the downloader.

9. How to remove duplicates in mysql, Set primary key deduplication, and primary key conflict

Solution: add on duplicate key update content = VALUES (content) after the inserted SQL statement # This is the content to be updated.

10. manually enter the verification code (zhihu. login_requests.py)

1 def get_captcha (): 2 import time 3 t = str (int (time. time () * 1000) 4 captcha_url = "https://www.zhihu.com/captcha.gif? R = {0} & type = login ". format (t) 5 t = session. get (captcha_url, headers = header) 6 with open ("captcha.jpg", "wb") as f: 7 f. write (t. content) 8 f. close () 9 captcha = input ("input verification code:") 10 return captcha
# Why is session rather than requests on the fifth line?
# Because requests will re-establish a painting session, which is inconsistent with the following parameters, the entered verification code is not the current verification code.

Author: Jin Xiao

Source: http://www.cnblogs.com/jinxiao-pu/p/6749332.html

The copyright of this article is shared by the author and the blog. You are welcome to repost this article, but you must keep this statement without the author's consent and provide a connection to the original article on the article page.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.