The school server can sisu net, so intends to write an automatic crawl joke and to BBS things, from the web search a joke website, feel most is not too cold, HTML structure as follows:
You can see that the list of jokes in the <div class= "List_title" > inside, with regular expressions can be a few recent jokes address to find out, and then into a joke page to see:
Every joke page is composed of several small jokes, all in <span id= "text110" > tag, each small joke and a separate <p> package, so it is very easy to put each individual joke into a list. As I crawl jokes to the purpose is one hours a day to send a joke, so crawl 20 is enough, each page on average has 5 small jokes, climb 4 pages OK. Here are a few details, the joke web links are in Chinese, such as:
1 <a href="/jokehtml/Cold joke/2014051200030765.htm" target="_blank "> Reading broken million, funny as God </a>
The direct Urllib.request.urlopen function can not parse the Chinese URL, must be urllib.parse first transcoding to correctly parse. Another detail is that there is a newline between each of the little jokes, with the regular expression "." is not matched by a newline character, it needs to be changed to "[\w\w]" to match. Okay, here's the code:
1 Importurllib.request2 ImportUrllib.parse3 ImportRe4 5Rule_joke=re.compile ('<span id=\ "text110\" > ([\w\w]*?) </span>') 6Rule_url=re.compile ('<a href=\ "(. *?) \ "target=\" _blank\ ">') 7Mainurl='http://www.jokeji.cn' 8Url='http://www.jokeji.cn/list.htm' 9 Tenreq=urllib.request.urlopen (URL) OneHtml=req.read (). Decode ('GBK') Aurls=rule_url.findall (HTML) -F=open ('Joke.txt','W') - forIinchRange (4): theUrl2=urllib.parse.quote (urls[i]) -joke_url=mainurl+Url2 -Req2=Urllib.request.urlopen (Joke_url) -Html2=req2.read (). Decode ('GBK') +Joke=Rule_joke.findall (HTML2) -Jokes=joke[0].split ('<P>') + A forIinchJokes: atI=i.replace ('</P>',"') -I=i.replace ('<BR>',"') -I=i[2:] - F.write (i) -F.close ()
Take a look at the results of the crawl:
In this way, each line is a separate joke that facilitates the use of other programs.
Python crawler case: Crawling joke nets with Python