Python Reptile Small Project: Crawling with embarrassing Wikipedia jokes

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

" This article has been written for a month or two, the middle of a busy final exam and other things are not to care about it, just run a bit of code found that a coding error, after crawling through the first page, the following error occurred:

unicodeencodeerror: ' GBK ' codec can ' t encode character ' \u22ef ' in position 93:illegal, multibyte.

After querying some information, after referring to the relevant instructions in the Blog park, add the following statement at the beginning of the code:

Import IO
import sys
sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' gb18030 ') #改变标准输出的默认编码

Because I was running under CMD, so I need to change the standard output of the default encoding, specific instructions please refer to the blog park in the relevant instructions .

Update Time: 2017/1/12

=============================================================================================================== =====================

Python crawler has always been the direction I want to start, through the static find Cia Qingcai personal blog Learning, completed a number of small projects. Thanks to it and its blog, we also recommend that you learn.

This article is a Python 3.x version of the Python crawler code that captures the top story of the embarrassing encyclopedia, hoping to inspire people who have learned the same blog post and want to do it with python3.x.

For specific steps, please refer to the catch story of the encyclopedia hot jokes, this article is only finished. the regular expressions in this article are currently available.

some of the differences between the python2.x and python3.x used in this project are:

1. The referenced modules are different,

The urllib2 in the python2.x is Urllib.request in the python3.x.

2. Different output modes

print in python2.x is print in python3.x ()

3. Different input modes

raw_input () in python2.x () in python3.x for input ()

4. Handling Unusual Differences

try in python2.x: ...

except Urllib2. Urlerror,e: ...

in python3.x for try: ... ..

except urllib.request.urlerror ase: ...

#糗事百科段子爬取 #-*-coding:utf-8-*-import urllib.request import re #修改部分 import io import sys sys.stdout = io. Textiowrapper (sys.stdout.buffer,encoding= ' gb18030 ') #改变标准输出的默认编码 #糗事百科爬虫类 class QSBK: #初始化方法, defining Variable def __init__ (self): Self.pageindex = 1 self.user_agent = ' mozilla/4.0 (compatible; MSIE 5.5;
        Windows NT) ' #初始化headers self.headers = {' user-agent ': self.user_agent} #存放段子的变量, each element is a piece of each page Self.stories = [] #存放程序是否继续运行的变量 self.enable = False #传入某一页的索引获得页面代码 def getpage ( Self,pageindex): try:url= ' http://www.qiushibaike.com/hot/page/' + str (pageIndex) #构建请求的re Quest request = Urllib.request.Request (url,headers =self.headers) #利用urlopen获取页面代码 Res
            Ponse = Urllib.request.urlopen (Request) #将页面转化为UTF-8 encoded Pagecode=response.read (). Decode (' Utf-8 ') return Pagecode except Urllib.request.URLErRor as E:if hasattr (E, "Reason"): Print (U "Connection embarrassing encyclopedia failure, cause of error", E.reason) return None  #传入某一页代码, go back to this page without a picture of the sketch list Def getpageitems (self,pageindex): Pagecode=self.getpage (PageIndex) if not Pagecode:print ("Page load failed ...") return to None pattern = re.compile (' <div class= ' author CLE Arfix ">.*?title=.*? 

 

 
 

 
 


Project Effect Chart:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Reptile Small Project: Crawling with embarrassing Wikipedia jokes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Reptile Small Project: Crawling with embarrassing Wikipedia jokes

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support