This article demonstrates how to use python to crawl the romance of the Three Kingdoms through examples. it is a good example for friends who want to learn python. if you need it, let's take a look at it.
The crawler tutorial in this article is divided into four parts:
1. where to crawl where
2. what to crawl
3. how to crawl
4. how to save the information after crawling
1. where to crawl
Romance of Three Kingdoms
II. crawling
Three Kingdoms full text
3. how to crawl
Open F12 on the Chrome page to find the article content on the node.
You only need to find this node and write the content to an html file.
content = soup.find("p", {"class": "bookyuanjiao", "id": "con"})
4. how to save after crawling
It is mainly to get the content, splice it into an html file, and save it.
#! Usr/bin/env #-*-coding: UTF-8-*-import urllib2import osfrom bs4 import BeautifulSoup as BSimport localeimport sysfrom lxml import etreeimport rereload (sys) sys. setdefaultencoding ('gbk') sub_folder = OS. path. join (OS. getcwd (), "sanguoyanyi") if not OS. path. exists (sub_folder): OS. mkdir (sub_folder) path = sub_folder # customize html as head of the articlesinput = open(r'0.html ', 'r') head = input. read () domain =' http://www.shicimingju.com/book/sanguoyanyi.html 'T= domain.find(r'.html ') new_domain = '/'. join (domain. split ("/") [:-2]) first_chapter_url = domain [: t] + "/" + str (1) + '.html 'print first_chapter_url # Get url if chapter listsreq = urllib2.Request (url = domain) resp = urllib2.urlopen (req) html = resp. read () soup = BS (html, 'lxml') chapter_list = soup. find ("p", {"class": "bookyuanjiao", "id": "mulu"}) sel = etree. HTML (str (chapter_list) result = sel. xpath ('// li/a/@ href') for each_link in result: each_chapter_link = new_domain + "/" + each_link print each_chapter_link req = urllib2.Request (url = each_chapter_link) resp = urllib2.urlopen (req) html = resp. read () soup = BS (html, 'lxml') content = soup. find ("p", {"class": "bookyuanjiao", "id": "con"}) title = soup. title. text title = title. split (u' _ romance of the Three Kingdoms _ poetry name sentence Net ') [0] html = str (content) html = head + html +""Filename = path +" \ "+ title + ". html "print filename # write file output = open (filename, 'w') output. write (html) output. close ()
The content of 0.html is as follows:
Summary
The above is how to use Python to crawl the romance of the Three Kingdoms. I hope it will be helpful for you to learn python. if you have any questions, please leave a message.
For more articles about how Python crawls the romance of the Three Kingdoms, refer to the Chinese PHP website!