Python3 automatic crawling joke

Source: Internet
Author: User

The school server can be connected to the Internet, so I plan to write something that automatically crawls jokes and sends them to bbs. I searched a joke website on the internet and found that most of them are not too cold. The html structure is as follows:



As you can see, the list of joke links is in <div class = "list_title">. You can use a regular expression to find the most recent joke addresses and go to a joke page to view them:



Each joke page is composed of multiple jokes, all of which are under the <span id = "text110"> label, and each joke is wrapped separately <p>, in this way, you can easily put each joke in a list. Since the purpose of crawling jokes is to make a joke every day for one hour, it is enough to crawl 20 jokes. Each page has an average of 5 jokes, it's okay to crawl 4 pages. Here are a few details. Some links of this joke network are in Chinese, such:

<A href = "/jokehtml/joke/2014051200030765.htm" target =" _ blank "> read books and learn how to be funny </a>

The urllib. request. urlopen function cannot parse URLs in Chinese. urllib. parse must be transcoded before parsing. Another detail is that there is a line break between each joke. The regular expression "." cannot match the line break. You need to change it to "[\ w \ W]" to match. Now, the Code is as follows:

import urllib.requestimport urllib.parseimport rerule_joke=re.compile('<span id=\"text110\">([\w\W]*?)</span>')rule_url=re.compile('<a href=\"(.*?)\"target=\"_blank\" >')mainUrl='http://www.jokeji.cn'url='http://www.jokeji.cn/list.htm'req=urllib.request.urlopen(url)html=req.read().decode('gbk')urls=rule_url.findall(html)f=open('joke.txt','w')for i in range(4):url2=urllib.parse.quote(urls[i])joke_url=mainUrl+url2req2=urllib.request.urlopen(joke_url)html2=req2.read().decode('gbk')joke=rule_joke.findall(html2)jokes=joke[0].split('<P>')for i in jokes:i=i.replace('</P>','')i=i.replace('<BR>','')i=i[2:]f.write(i)f.close()

View the crawling result:



In this way, each line is a separate joke for other programs to use.


Reprinted Please note:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.