Python crawlers crawl the Chinese version of python tutorial and save it as word,
I saw the Chinese version of python tutorial and found it was a web version. I was learning crawler recently. I thought it would be better to capture it locally.
First, the webpage content
After viewing the source code of the webpage, you can use BeautifulSoup to obtain the document title and content and save it as a doc file.
Here we need to use from bs4 import BeautifulSoup to import this module
The Code is as follows:
# Output the content of the website
From bs4 import BeautifulSoupdef introduce (url): res = requests. get (url) res. encoding = 'utf-8' soup = BeautifulSoup (res. text, 'html. parser ') title = soup. select ('h1 ') [0]. text content = '\ n '. join ([p. text. strip () for p in soup. select ('. section ')]) # print (title) # print (content)
The next step is to use the for loop to traverse all the matched content to get the link pointed to by the Directory. The obtained link is incomplete, so add the main site link to it, generate a valid url and store it in the listAddress. After comparison, I used xpath to capture the directory address. Therefore, I used from lxml import etree to import this module.
# Return the address def get_url (selector) corresponding to the directory: sites = selector. xpath ('// div [@ class = "toctree-wrapper compound"]/ul/li') address = [] for site in sites: directory = ''. join (site. xpath ('A/text () ') new_url = site. xpath ('A/@ href ') address. append ('HTTP: // www.pythondoc.com/pythontutorial3/' + ''. join (new_url) return address
Then, call get_url () in the main function to traverse all the URLs, call the introduce () function, and output all the text content.
def main(): url = 'http://www.pythondoc.com/pythontutorial3/index.html#' html = requests.get(url) html.encoding = 'utf-8' selector = etree.HTML(html.text) introduce(url) url_list = get_url(selector) for url in url_list: introduce(url)if __name__ == '__main__': main()
The last step is to write the output data to the. DOC file. Call the OS module and place the file writing command in the introduce () function.
Import OS # Place it on top with open('python.doc ', 'a +', encoding = 'utf-8') as f: f. write (content)
Now, we have obtained the Chinese version of python tutorial and successfully written it into a local file. It is good for me who frequently break network breakpoints! You can also watch it on your phone.
For bs4, you can directly use the pip install bs4 command on the command line for installation.
In windows, lxml installation may cause many errors. We recommend that you download the corresponding version of lxml from the Python extension package website in windows. and then use pip install ************** to install the whl file locally,
Note:
* ************ Represents the full name of the Installation File.
During installation, you must switch to the directory where the downloaded file is located under the command line. Otherwise, an error is reported.