The school server can be connected to the Internet, so I plan to write something that automatically crawls jokes and sends them to bbs. I searched a joke website on the internet and found that most of them are not too cold. The html structure is as follows:
As you can see, the list of joke links is in <div class = "list_title">. You can use a regular expression to find the most recent joke addresses and go to a joke page to view them:
Each joke page is composed of multiple jokes, all of which are under the <span id = "text110"> label, and each joke is wrapped separately <p>, in this way, you can easily put each joke in a list. Since the purpose of crawling jokes is to make a joke every day for one hour, it is enough to crawl 20 jokes. Each page has an average of 5 jokes, it's okay to crawl 4 pages. Here are a few details. Some links of this joke network are in Chinese, such:
<A href = "/jokehtml/joke/2014051200030765.htm" target =" _ blank "> read books and learn how to be funny </a>
The urllib. request. urlopen function cannot parse URLs in Chinese. urllib. parse must be transcoded before parsing. Another detail is that there is a line break between each joke. The regular expression "." cannot match the line break. You need to change it to "[\ w \ W]" to match. Now, the Code is as follows:
import urllib.requestimport urllib.parseimport rerule_joke=re.compile('<span id=\"text110\">([\w\W]*?)</span>')rule_url=re.compile('<a href=\"(.*?)\"target=\"_blank\" >')mainUrl='http://www.jokeji.cn'url='http://www.jokeji.cn/list.htm'req=urllib.request.urlopen(url)html=req.read().decode('gbk')urls=rule_url.findall(html)f=open('joke.txt','w')for i in range(4):url2=urllib.parse.quote(urls[i])joke_url=mainUrl+url2req2=urllib.request.urlopen(joke_url)html2=req2.read().decode('gbk')joke=rule_joke.findall(html2)jokes=joke[0].split('<P>')for i in jokes:i=i.replace('</P>','')i=i.replace('<BR>','')i=i[2:]f.write(i)f.close()
View the crawling result:
In this way, each line is a separate joke for other programs to use.
Reprinted Please note: