Python crawlers crawl the Chinese version of python tutorial and save it as word,

Source: Internet
Author: User

Python crawlers crawl the Chinese version of python tutorial and save it as word,

I saw the Chinese version of python tutorial and found it was a web version. I was learning crawler recently. I thought it would be better to capture it locally.

First, the webpage content

After viewing the source code of the webpage, you can use BeautifulSoup to obtain the document title and content and save it as a doc file.

Here we need to use from bs4 import BeautifulSoup to import this module

The Code is as follows:

# Output the content of the website
From bs4 import BeautifulSoupdef introduce (url): res = requests. get (url) res. encoding = 'utf-8' soup = BeautifulSoup (res. text, 'html. parser ') title = soup. select ('h1 ') [0]. text content = '\ n '. join ([p. text. strip () for p in soup. select ('. section ')]) # print (title) # print (content)

The next step is to use the for loop to traverse all the matched content to get the link pointed to by the Directory. The obtained link is incomplete, so add the main site link to it, generate a valid url and store it in the listAddress. After comparison, I used xpath to capture the directory address. Therefore, I used from lxml import etree to import this module.

# Return the address def get_url (selector) corresponding to the directory: sites = selector. xpath ('// div [@ class = "toctree-wrapper compound"]/ul/li') address = [] for site in sites: directory = ''. join (site. xpath ('A/text () ') new_url = site. xpath ('A/@ href ') address. append ('HTTP: // www.pythondoc.com/pythontutorial3/' + ''. join (new_url) return address

Then, call get_url () in the main function to traverse all the URLs, call the introduce () function, and output all the text content.

def main():     url = 'http://www.pythondoc.com/pythontutorial3/index.html#'    html = requests.get(url)    html.encoding = 'utf-8'    selector = etree.HTML(html.text)    introduce(url)    url_list = get_url(selector)    for url in url_list:        introduce(url)if __name__ == '__main__':    main()

The last step is to write the output data to the. DOC file. Call the OS module and place the file writing command in the introduce () function.

Import OS # Place it on top with open('python.doc ', 'a +', encoding = 'utf-8') as f: f. write (content)

Now, we have obtained the Chinese version of python tutorial and successfully written it into a local file. It is good for me who frequently break network breakpoints! You can also watch it on your phone.

 

For bs4, you can directly use the pip install bs4 command on the command line for installation.

In windows, lxml installation may cause many errors. We recommend that you download the corresponding version of lxml from the Python extension package website in windows. and then use pip install ************** to install the whl file locally,

Note:

* ************ Represents the full name of the Installation File.

During installation, you must switch to the directory where the downloaded file is located under the command line. Otherwise, an error is reported.

 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.