The Decode function converts a string from one encoding to a Unicode character

Source: Internet
Author: User

Environment: Ubuntu, Python 2.7

Basic knowledge

This program involves a number of knowledge points, listed here, not in detail, there are questions directly Baidu will have a bunch of.

The request for the 1.URLLIB2 module sets HTTP requests, including the crawled URLs, and the proxy that disguises the browser. Then the Urlopen and read methods are well understood.

2.chardet module for detecting the encoding of Web pages. Crawl data on the Web page is easy to encounter garbled problem, in order to determine whether the Web page is GTK encoding or utf-8, so use Chardet detect function to detect. Do not have this module of the students please download the installation, the landlord default is some.

3. The Decode function converts a string from one encoding to a Unicode character, while encode converts a Unicode character to a string in the specified encoded format.

4. The application of the re-module regular expression. The search function can find an entry corresponding to the regular expression, and replace replaces the matched string.

Thinking Analysis:

The URL we selected is the first chapter of the Http://www.quanben.com/xiaoshuo/0/910/59302.html,2881064151 Continental. You can see the source code of the page, you will find that only a content tag contains all the contents of the chapters, so you can use the regular to match content tags to, crawl down. Try to print out this part of the content, you will find a lot of  ,
and,
to replace the line break, is the Web page placeholder, that is, a space, replace with a space is good. The content of such a chapter is very beautiful out. For completeness, the title is also crawled down with a regular. #-*-coding:utf-8-*-import urllib2 import re import Chardet class Book_spider:def __init__ (self): Self.pages = [] # Snatch Take a section def getpage (self): Myurl = "http://www.quanben.com/xiaoshuo/0/910/59302.html"; User_agent = ' mozilla/4.0 (compatible; MSIE 5.5; windows nt) ' headers = {' User-agent ': user_agent} request = Urllib2. Request (Myurl, headers = headers) Myresponse = Urllib2.urlopen (request) MyPage = Myresponse.read () #先检测网页的字符编码, finally unified to UTF -8 charset = Chardet.detect (mypage) charset = charset[' encoding '] if charset = = ' Utf-8 ' or charset = = ' UTF-8 ': MyPage = my Page else:mypage = Mypage.decode (' gb2312 ', ' ignore '). Encode (' utf-8 ') unicodepage = Mypage.decode ("Utf-8") #抓取标题 My_ title = Re.search (' (. *?)

The

Decode function converts a string from one encoding to a Unicode character

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.