" This article has been written for a month or two, the middle of a busy final exam and other things are not to care about it, just run a bit of code found that a coding error, after crawling through the first page, the following error occurred:
unicodeencodeerror: ' GBK ' codec can ' t encode character ' \u22ef ' in position 93:illegal, multibyte.
After querying some information, after referring to the relevant instructions in the Blog park, add the following statement at the beginning of the code:
Import IO
import sys
sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' gb18030 ') #改变标准输出的默认编码
Because I was running under CMD, so I need to change the standard output of the default encoding, specific instructions please refer to the blog park in the relevant instructions .
Update Time: 2017/1/12
=============================================================================================================== =====================
Python crawler has always been the direction I want to start, through the static find Cia Qingcai personal blog Learning, completed a number of small projects. Thanks to it and its blog, we also recommend that you learn.
This article is a Python 3.x version of the Python crawler code that captures the top story of the embarrassing encyclopedia, hoping to inspire people who have learned the same blog post and want to do it with python3.x.
For specific steps, please refer to the catch story of the encyclopedia hot jokes, this article is only finished. the regular expressions in this article are currently available.
some of the differences between the python2.x and python3.x used in this project are:
1. The referenced modules are different,
The urllib2 in the python2.x is Urllib.request in the python3.x.
2. Different output modes
print in python2.x is print in python3.x ()
3. Different input modes
raw_input () in python2.x () in python3.x for input ()
4. Handling Unusual Differences
try in python2.x: ...
except Urllib2. Urlerror,e: ...
in python3.x for try: ... ..
except urllib.request.urlerror ase: ...
#糗事百科段子爬取 #-*-coding:utf-8-*-import urllib.request import re #修改部分 import io import sys sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' gb18030 ') #改变标准输出的默认编码 #糗事百科爬虫类 class QSBK: #初始化方法, defining Variable def __init__ (self): Self.pageindex = 1 self.user_agent = ' mozilla/4.0 (compatible; MSIE 5.5;
Windows NT) ' #初始化headers self.headers = {' user-agent ': self.user_agent} #存放段子的变量, each element is a piece of each page Self.stories = [] #存放程序是否继续运行的变量 self.enable = False #传入某一页的索引获得页面代码 def getpage ( Self,pageindex): try:url= ' http://www.qiushibaike.com/hot/page/' + str (pageIndex) #构建请求的re Quest request = Urllib.request.Request (url,headers =self.headers) #利用urlopen获取页面代码 Res
Ponse = Urllib.request.urlopen (Request) #将页面转化为UTF-8 encoded Pagecode=response.read (). Decode (' Utf-8 ') return Pagecode except Urllib.request.URLErRor as E:if hasattr (E, "Reason"): Print (U "Connection embarrassing encyclopedia failure, cause of error", E.reason) return None #传入某一页代码, go back to this page without a picture of the sketch list Def getpageitems (self,pageindex): Pagecode=self.getpage (PageIndex) if not Pagecode:print ("Page load failed ...") return to None pattern = re.compile (' <div class= ' author CLE Arfix ">.*?title=.*?
Project Effect Chart: