Python crawler case: Crawling joke nets with Python

Source: Internet
Author: User
Tags nets

The school server can sisu net, so intends to write an automatic crawl joke and to BBS things, from the web search a joke website, feel most is not too cold, HTML structure as follows:

You can see that the list of jokes in the <div class= "List_title" > inside, with regular expressions can be a few recent jokes address to find out, and then into a joke page to see:

Every joke page is composed of several small jokes, all in <span id= "text110" > tag, each small joke and a separate <p> package, so it is very easy to put each individual joke into a list. As I crawl jokes to the purpose is one hours a day to send a joke, so crawl 20 is enough, each page on average has 5 small jokes, climb 4 pages OK. Here are a few details, the joke web links are in Chinese, such as:

1 <a href="/jokehtml/Cold joke/2014051200030765.htm" target="_blank  "> Reading broken million, funny as God </a>  

The direct Urllib.request.urlopen function can not parse the Chinese URL, must be urllib.parse first transcoding to correctly parse. Another detail is that there is a newline between each of the little jokes, with the regular expression "." is not matched by a newline character, it needs to be changed to "[\w\w]" to match. Okay, here's the code:

1 Importurllib.request2 ImportUrllib.parse3 ImportRe4   5Rule_joke=re.compile ('<span id=\ "text110\" > ([\w\w]*?) </span>')  6Rule_url=re.compile ('<a href=\ "(. *?) \ "target=\" _blank\ ">')  7Mainurl='http://www.jokeji.cn'  8Url='http://www.jokeji.cn/list.htm'  9   Tenreq=urllib.request.urlopen (URL) OneHtml=req.read (). Decode ('GBK')   Aurls=rule_url.findall (HTML) -F=open ('Joke.txt','W')   -  forIinchRange (4):   theUrl2=urllib.parse.quote (urls[i]) -joke_url=mainurl+Url2 -Req2=Urllib.request.urlopen (Joke_url) -Html2=req2.read (). Decode ('GBK')   +Joke=Rule_joke.findall (HTML2) -Jokes=joke[0].split ('<P>')   +        A      forIinchJokes: atI=i.replace ('</P>',"')   -I=i.replace ('<BR>',"')   -I=i[2:]   - F.write (i) -F.close ()

Take a look at the results of the crawl:

In this way, each line is a separate joke that facilitates the use of other programs.

Python crawler case: Crawling joke nets with Python

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.