Python Reptile Small Project: Crawling with embarrassing Wikipedia jokes

Source: Internet
Author: User

" This article has been written for a month or two, the middle of a busy final exam and other things are not to care about it, just run a bit of code found that a coding error, after crawling through the first page, the following error occurred:

unicodeencodeerror: ' GBK ' codec can ' t encode character ' \u22ef ' in position 93:illegal, multibyte.

After querying some information, after referring to the relevant instructions in the Blog park, add the following statement at the beginning of the code:

Import IO
import sys
sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' gb18030 ') #改变标准输出的默认编码

Because I was running under CMD, so I need to change the standard output of the default encoding, specific instructions please refer to the blog park in the relevant instructions .

Update Time: 2017/1/12

=============================================================================================================== =====================

Python crawler has always been the direction I want to start, through the static find Cia Qingcai personal blog Learning, completed a number of small projects. Thanks to it and its blog, we also recommend that you learn.

This article is a Python 3.x version of the Python crawler code that captures the top story of the embarrassing encyclopedia, hoping to inspire people who have learned the same blog post and want to do it with python3.x.

For specific steps, please refer to the catch story of the encyclopedia hot jokes, this article is only finished. the regular expressions in this article are currently available.


some of the differences between the python2.x and python3.x used in this project are:

1. The referenced modules are different,

The urllib2 in the python2.x is Urllib.request in the python3.x.

2. Different output modes

print in python2.x is print in python3.x ()

3. Different input modes

raw_input () in python2.x () in python3.x for input ()

4. Handling Unusual Differences

try in python2.x: ...

except Urllib2. Urlerror,e: ...

in python3.x for try: ... ..

except urllib.request.urlerror  ase: ...

#糗事百科段子爬取 #-*-coding:utf-8-*-import urllib.request import re #修改部分 import io import sys sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' gb18030 ') #改变标准输出的默认编码 #糗事百科爬虫类 class QSBK: #初始化方法, defining Variable def __init__ (self): Self.pageindex = 1 self.user_agent = ' mozilla/4.0 (compatible; MSIE 5.5;
        Windows NT) ' #初始化headers self.headers = {' user-agent ': self.user_agent} #存放段子的变量, each element is a piece of each page Self.stories = [] #存放程序是否继续运行的变量 self.enable = False #传入某一页的索引获得页面代码 def getpage ( Self,pageindex): try:url= ' http://www.qiushibaike.com/hot/page/' + str (pageIndex) #构建请求的re Quest request = Urllib.request.Request (url,headers =self.headers) #利用urlopen获取页面代码 Res
            Ponse = Urllib.request.urlopen (Request) #将页面转化为UTF-8 encoded Pagecode=response.read (). Decode (' Utf-8 ') return Pagecode except Urllib.request.URLErRor as E:if hasattr (E, "Reason"): Print (U "Connection embarrassing encyclopedia failure, cause of error", E.reason) return None  #传入某一页代码, go back to this page without a picture of the sketch list Def getpageitems (self,pageindex): Pagecode=self.getpage (PageIndex) if not Pagecode:print ("Page load failed ...") return to None pattern = re.compile (' <div class= ' author CLE Arfix ">.*?title=.*? 




Project Effect Chart:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.