Environment: Ubuntu, Python 2.7
Basic knowledge
This program involves a number of knowledge points, listed here, not in detail, there are questions directly Baidu will have a bunch of.
The request for the 1.URLLIB2 module sets HTTP requests, including the crawled URLs, and the proxy that disguises the browser. Then the Urlopen and read methods are well understood.
2.chardet module for detecting the encoding of Web pages. Crawl data on the Web page is easy to encounter garbled problem, in order to determine whether the Web page is GTK encoding or utf-8, so use Chardet detect function to detect. Do not have this module of the students please download the installation, the landlord default is some.
3. The Decode function converts a string from one encoding to a Unicode character, while encode converts a Unicode character to a string in the specified encoded format.
4. The application of the re-module regular expression. The search function can find an entry corresponding to the regular expression, and replace replaces the matched string.
Thinking Analysis:
The URL we selected is the first chapter of the Http://www.quanben.com/xiaoshuo/0/910/59302.html,2881064151 Continental. You can see the source code of the page, you will find that only a content tag contains all the contents of the chapters, so you can use the regular to match content tags to, crawl down. Try to print out this part of the content, you will find a lot of ,
and,
to replace the line break, is the Web page placeholder, that is, a space, replace with a space is good. The content of such a chapter is very beautiful out. For completeness, the title is also crawled down with a regular. #-*-coding:utf-8-*-import urllib2 import re import Chardet class Book_spider:def __init__ (self): Self.pages = [] # Snatch Take a section def getpage (self): Myurl = "http://www.quanben.com/xiaoshuo/0/910/59302.html"; User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; windows nt) ' headers = {' User-agent ': user_agent} request = Urllib2. Request (Myurl, headers = headers) Myresponse = Urllib2.urlopen (request) MyPage = Myresponse.read () #先检测网页的字符编码, finally unified to UTF -8 charset = Chardet.detect (mypage) charset = charset[' encoding '] if charset = = ' Utf-8 ' or charset = = ' UTF-8 ': MyPage = my Page else:mypage = Mypage.decode (' gb2312 ', ' ignore '). Encode (' utf-8 ') unicodepage = Mypage.decode ("Utf-8") #抓取标题 My_ title = Re.search (' (. *?)
The
Decode function converts a string from one encoding to a Unicode character