The school server can sisu network, so intends to write a self-crawling joke to the BBS of things, from the web search a joke site, feel most is not too cold. HTML structures such as the following:
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvbgl0dgxldgh1bmrlcg==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast ">
Can see, joke of the list of links in <div class= "List_title" > Inside, with a regular table to find out the recent several jokes address, and then into a joke page to see:
watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvbgl0dgxldgh1bmrlcg==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast ">
Every joke page is composed of several small jokes. All under <span id= "text110" > tags, each little joke is a separate <p> package, so easy to put each individual joke into a list. Because the purpose of my climbing jokes is to send a joke one hours a day. So crawl 20 is enough, each page on average has 5 small jokes, crawl 4 pages on OK.
Here are a few details. This joke web has links to Chinese, for example:
<a href= "/jokehtml/Cold Jokes/2014051200030765.htm" target= "_blank" > Reading broken million Juan, funny as God </a>
The direct Urllib.request.urlopen function cannot parse the Chinese URL. Must be urllib.parse first to transcode the correct interpretation of talent. Another detail is that there is a newline between each of the little jokes, with the "." In the form of a regular expression. It is not possible to match the newline character, it needs to be changed to "[\w\w]" talent match. Okay, here's the code:
Import urllib.requestimport urllib.parseimport rerule_joke=re.compile (' <span id=\ ' text110\ ' > ([\w\W]*?) </span> ') rule_url=re.compile (' <a href=\ "(. *?) \ "target=\" _blank\ ">") mainurl= ' http://www.jokeji.cn ' url= ' http://www.jokeji.cn/list.htm ' req= Urllib.request.urlopen (URL) html=req.read (). Decode (' GBK ') urls=rule_url.findall (HTML) f=open (' Joke.txt ', ' W ') for I In range (4): Url2=urllib.parse.quote (Urls[i]) Joke_url=mainurl+url2req2=urllib.request.urlopen (joke_url) html2= Req2.read (). Decode (' GBK ') joke=rule_joke.findall (HTML2) jokes=joke[0].split (' <P> ') for I in Jokes:i=i.replace ( ' </P> ', ' i=i.replace (' <BR> ', ') i=i[2:]f.write (i) f.close ()
Take a look at the results of the crawl:
In this way, each line is a separate joke. Easy for other programs to use.
Reprint Please specify: Transfer from http://blog.csdn.net/littlethunder/article/details/25693641
Python3 to climb jokes on his own initiative