Use request to download novels and request to download Novels
Use requests
Use requestsTable of Contents
- 1. Use rquests
- 1.1. Official Quick Guide document (the latest version. Pay attention to the version number when using a specific version)
- 1.2. Open a website
- 1.3. Send other requests
- 1.3.1. Use requests. post (url) to send a post request)
- 1.3.2. Similarly, this method is used to send requests.
- 1.3.3. Contain cookies to get cookies in the browser and pass them to get (url, cookies = cookies)
- 1.4. Determine whether the file is opened correctly and the file encoding
- 1.4.1. requests. codes. OK is successful.
- 1.5. Get the returned content
- 1.5.1. Get Text
- 1.5.2. Obtain binary value
- 1.5.3. Response json
- 1.5.4. original response content
- 1.6. Write to file
- 1.7. parse the content BeautifulSoup
- 1.8. Use select to find the specified object
- 1.8.1. Use select to obtain the specified tag. The search syntax is similar to the css selector.
- 1.8.2. Get the Tag Name and content
- 2. Instance
- 2.1. Pass some key-value pairs to use Baidu search
- 2.2. Send cookies to Weibo
- 2.3. Download the novel <Tao medical world>
1. Use the official Quick Guide document rquests 1.1 (the latest version, pay attention to the version number when using a specific version)
Http://docs.python-requests.org/zh_CN/latest/user/quickstart.html#id2
1.2 open a website
Res = requests. get (url)
1.3 send other requests 1.3.1 use requests. post (url) 1.3.2 For post requests. This method is used to send requests.
For example: requests. put () requests. delete () requests. put () requests. options ()
1.3.3 contain cookies to get cookies in the browser and pass them to get (url, cookies = cookies)
To obtain cookies in chrome, open the three dots in the upper right corner of the page to be obtained. More tools-> Headers in name in developer tool NetWork contains cookies.
For example, the cookies for Weibo login are stored in namefavicon. ico.
1.4 check whether the file is opened correctly and check the file encoding
Res. encoding # Get the webpage code. When this attribute is assigned a value, call res. text again and use the new encoding res. status_code # Get the returned http status code.
1.4.1 requests. codes. OK succeeded
Res. rails_for_status () # If the returned value is not 200, an exception is thrown.
Use try again t to wrap the package to get a clearer error message
try: res.rails_for_status()except Exception as exp: print(exp)
1.5 get the returned content 1.5.1 get the text
Res. text type (res. text) # => <class 'str'> you can know that the returned value is of the str type. Slice and in can be easily used.
1.5.2 obtain binary value
You can also use res. text, but the official example of using content is provided.
>>> from PIL import Image>>> from io import BytesIO>>> i = Image.open(BytesIO(r.content))
1.5.3 response json
R. json () throws an exception when parsing fails.
1.5.4 original response content
R = requests. get ('url', stream = True) set stream = True to get the original socket content
1.6 write to file
You can directly use the with open method to write data to a file, but when the file is too large, it will occupy a large amount of memory, so part of the write
for part_file in res.iter_content(size): file.write(part_file)
Sets the size (kb). The specified size is written each time.
1.7 parse content BeautifulSoup
Official documentation: https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html can use BeautifulSoup to download the pip3 install bs4 package using import bs4 to get some definite content
Be = bs4.BeautifulSoup (htmlfile, "html. parser ")
1.8 use select to find the specified object 1.8.1 use select to obtain the specified tag. The syntax is similar to the css selector.
Select ('div ') returns the content of all div labels' # id' id is id '. class 'all classes are class de' div. class ''a [href] 'of the class enclosed in all divs all <a href = '???? '> <A> can have other attributes 'a [href = 'https: // www.baidu.com'] 'all <a href = 'https: // www.baidu.com'>, <a> you can have other attributes 'div> a' <div> <a> no other tags in the middle.
1.8.2 obtain the Tag Name and content
Select returns the list of Tag objects eles = select ('. menu ') type (eles) # => class 'LIST' # print all matching parts (eles) # output a matching print (eles [0]) # output tag value # For example, <a> Baidu </a> outputs Baidu print (eles [0]. getText () # print (eles [0]. attrs) # Only the current tag is printed without outputting the sub-TAG content.2. instance 2.1 transmits key-value pairs to use Baidu search
Take Baidu as an example to use Baidu search, need to build a url https://www.baidu.com/s in this way? Wd = % E5 % 85% B3 % E9 % 94% AE % E5 % AD % 97 then we need to use get ('url', params = parameterdict) for this result. For example:
Search = input ("Enter the content you want baidu") baidu = 'https: // www.baidu.com/s? 'Search _ params = {'wd ': search} try: baidu_re = requests. get (baidu, params = search_params) failed t Exception as err: print (err) finally: pass # specifies the file write encoding to prevent webpage encoding from being UTF-8 and compatible with winwith open('baidusearch.html ', 'w', encoding = baidu_re.encoding) as html: html. write (baidu_re.text)2.2 send cookies to Weibo
Many functions of Weibo can be used only after logon. Therefore, when obtaining information, you need to use the cookies used to log on to Weibo, log on to Weibo, and obtain cookies.
Import requestscookies = {'cookies': 'value'} # note that the Cookie: SINAGLOBAL .... all copied requests. get ('other Weibo pages', cookies = cookies) # Start to parse the desired content.2.3 download the novel <Dao weixia>
"Get all Chapter" "import requests, bs4, chardetfrom io import StringIOi = 1 strio = StringIO () next_href = 'HTTP: // www.ppxs.net/63/63820/19862177.html'text = "" # obtain the title and body and store it in the StringIO stream def getUrlText (url): global I, next_href, print page ("from this page ({}) start reading chapter ". format (url) page = requests. get (url) # Set the character encoding to the webpage encoding; otherwise, page_text = page will be garbled. text. encode (page. encoding) be = bs4.BeautifulSoup (page_text, "html. parser ") I + = 1 next_page = be. select (". bottem a ") next_href = next_page [3]. attrs ['href '] print ("next chapter % s" % next_href) title = be. select (". bookname h1 ") title_text = title [0]. text txt = be. select ("# booktext") per_txt = txt [0]. text # Start of each chapter. Delete in_text = per_txt.lstrip (''' <div class = "content" id = "booktext"> <! -- Go --> <p> <font color = "# FF0000" face = "" size = "3"> welcome to Renren novels. Remember the address: http://www.ppxs.net, mobile phone to read m.ppxs.net, so that you can read the novel "Tao medical world" the latest chapter at any time... </font> </p> '''). replace ("<br>", "") inner_text = title_text + "\ n" + in_text strio. write (inner_text) # getUrlText ('HTTP: // www.ppxs.net/63/63820/19862177.html') try: while next_href! = True: getUrlText (next_href) # next_href = getUrl (next_href) print (I) failed t Exception as e: print (e) finally: with open ("Dao medical world .txt ", 'A + ') as a:. write (strio. getvalue ())
Author: vz li Branch
Created: Tue
Emacs 25.1.1 (Org mode 8.2.10)