The crawler tutorial in this article is divided into four parts:
1. Where do I climb from?
2. What what to climb
3. How to Crawl
4. How the information saves save after crawling
One, from where to climb
Three kingdoms
Two, climb what
The full text of the kingdoms
Third, how to climb
When you open F12 on the Chrome page, you can find the content of the article in the node
<div id= "Con" class= "Bookyuanjiao" >
Just find this node and write the content to an HTML file.
Content = Soup.find ("div", {"Class": "Bookyuanjiao", "id": "Con"})
Four, how to save after the climb
The main thing is to get the content, splicing to an HTML file, and then save it.
#!usr/bin/env #-*-coding:utf-8-*-import urllib2 import os from BS4 import beautifulsoup as BS import locale import sy s from lxml import etree import re reload (SYS) sys.setdefaultencoding (' GBK ') Sub_folder = Os.path.join (OS.GETCWD (), "San Guoyanyi ") If not os.path.exists (Sub_folder): Os.mkdir (sub_folder) path = sub_folder # Customize HTML as head of the A Rticles input = open (R ' 0.html ', ' r ') head = input.read () domain = ' http://www.shicimingju.com/book/sanguoyanyi.html ' t = Domain.find (R '. html ') New_domain = '/'. Join (Domain.split ("/") [: -2]) First_chapter_url = Domain[:t] + "/" + str (1) + '. html ' Print First_chapter_url # Get URL if chapter lists req = Urllib2. Request (url=domain) resp = urllib2.urlopen (req) HTML = resp.read () soup = BS (html, ' lxml ') chapter_list = Soup.find ("div",
{"Class": "Bookyuanjiao", "id": "Mulu"}) sel = etree. HTML (str (chapter_list)) result = Sel.xpath ('//li/a/@href ') for each_link in Result:each_chapter_link = New_domain + "/ "+ Each_link Print eaCh_chapter_link req = urllib2. Request (url=each_chapter_link) resp = urllib2.urlopen (req) HTML = resp.read () soup = BS (html, ' lxml ') content = S Oup.find ("div", {"Class": "Bookyuanjiao", "id": "Con"}) title = Soup.title.text title = Title.split (U ' _ "The Kingdoms" _ poetry) [
0] html = str (content) HTML = head + HTML + "</body>
The contents of 0.html are as follows
Summarize
The above is to use Python to crawl the realization of the kingdoms of the way, I hope to learn Python can help, if there are questions you can message exchange.