Python3 to climb jokes on his own initiative

Source: Internet
Author: User

The school server can sisu network, so intends to write a self-crawling joke to the BBS of things, from the web search a joke site, feel most is not too cold. HTML structures such as the following:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvbgl0dgxldgh1bmrlcg==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast ">


Can see, joke of the list of links in <div class= "List_title" > Inside, with a regular table to find out the recent several jokes address, and then into a joke page to see:

watermark/2/text/ahr0cdovl2jsb2cuy3nkbi5uzxqvbgl0dgxldgh1bmrlcg==/font/5a6l5l2t/fontsize/400/fill/i0jbqkfcma== /dissolve/70/gravity/southeast ">


Every joke page is composed of several small jokes. All under <span id= "text110" > tags, each little joke is a separate <p> package, so easy to put each individual joke into a list. Because the purpose of my climbing jokes is to send a joke one hours a day. So crawl 20 is enough, each page on average has 5 small jokes, crawl 4 pages on OK.

Here are a few details. This joke web has links to Chinese, for example:

<a href= "/jokehtml/Cold Jokes/2014051200030765.htm" target= "_blank" > Reading broken million Juan, funny as God </a>

The direct Urllib.request.urlopen function cannot parse the Chinese URL. Must be urllib.parse first to transcode the correct interpretation of talent. Another detail is that there is a newline between each of the little jokes, with the "." In the form of a regular expression. It is not possible to match the newline character, it needs to be changed to "[\w\w]" talent match. Okay, here's the code:

Import urllib.requestimport urllib.parseimport rerule_joke=re.compile (' <span id=\ ' text110\ ' > ([\w\W]*?) </span> ') rule_url=re.compile (' <a href=\ "(. *?) \ "target=\" _blank\ ">") mainurl= ' http://www.jokeji.cn ' url= ' http://www.jokeji.cn/list.htm ' req= Urllib.request.urlopen (URL) html=req.read (). Decode (' GBK ') urls=rule_url.findall (HTML) f=open (' Joke.txt ', ' W ') for I In range (4): Url2=urllib.parse.quote (Urls[i]) Joke_url=mainurl+url2req2=urllib.request.urlopen (joke_url) html2= Req2.read (). Decode (' GBK ') joke=rule_joke.findall (HTML2) jokes=joke[0].split (' <P> ') for I in Jokes:i=i.replace ( ' </P> ', ' i=i.replace (' <BR> ', ') i=i[2:]f.write (i) f.close ()

Take a look at the results of the crawl:



In this way, each line is a separate joke. Easy for other programs to use.


Reprint Please specify: Transfer from http://blog.csdn.net/littlethunder/article/details/25693641

Python3 to climb jokes on his own initiative

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.