My first contact with the reptile this thing is in this year May, when wrote a blog search engine, the use of the crawler is also very intelligent, at least more than the film to the station to use a high level of reptiles!
Back to the topic of spider writing in Python.
Python has always been the scripting language I've mostly used, not one of them. Python's language is simple and flexible, the standard library is powerful, it can be used as a calculator, text encoding conversion, image processing, bulk downloading, batch processing text and so on. In short, I like, also more use the more hands-on, so useful a tool, the General people I do not tell him ...
Because of its powerful string-handling capabilities and the urllib2,cookielib,re,threading of these modules, Python is the easiest to write a crawler. Simple to what extent. I was talking to one of my classmates that I wrote a movie about a few of the reptiles I used and the number of fragmented script lines in the data, no more than 1000 lines in total, and the movie comes with only 150来 lines of code. Because the crawler code in another 64-bit black apple, so do not list, only listed VPS Web site code, tornadoweb framework written
[xiaoxia@307232 movie_site]$ wc-l *.py template/*
156 msite.py
Template/base.html
Template/category.html
Template/id.html
Template/index.html
Template/search.html
Here's a direct show of the crawler's writing process. The following content is for the exchange of learning and use, no other meaning.
Take the latest video download resource in a bay for example, its web site is
http://a piratebay.se/browse/200
Because there are a lot of ads on this page, just post the body part:
For a python crawler, download the source code for this page, one line of code is sufficient. The URLLIB2 library is used here.
>>> Import Urllib2
>>> html = urllib2.urlopen (' http://a piratebay.se/browse/200 '). Read ()
>>> print ' size is ', Len (HTML)
The size is 52977
Of course, you can also use the OS module in the system function call wget command to download Web page content, for the master wget or curl tools of the students is very convenient.
Using Firebug to observe the Web page structure, you can know that the body part of HTML is a table. Each resource is a TR tag.
For each resource, the information that needs to be extracted is:
1, Video classification
2. Resource Name
3. Resource Link
4. Resource size
5, Upload time
That's all that's enough, and if you need it, you can increase it.
First, take a look at the code in the TR tag.
<tr>
<TD class= "VERTTH" >
<center>
<a href= "/browse/200" title= "more in this directory" > video </a><br/>
(<a href= "/browse/205" title= "more in this directory" > TV </a>)
</center>
</td>
<td>
<div class= "Detname" > <a href= "/torrent/7782194/the_walking_dead_season_3_episodes_1-3_hdtv-x264" Detlink "title=" details The walking Dead Season 3 episodes 1-3 hdtv-x264 ">the walking Dead Season 3 episodes 1-3 HDTV-X264&L T;/a>
</div>
<a href= "magnet:?xt=urn:btih:4f63d58e51c1a4a997c6f099b2b529bdbba72741&dn=the+walking+dead+season+3+ episodes+1-3+hdtv-x264&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f% 2ftracker.publicbt.com%3a80&tr=udp%3a%2f%2ftracker.istole.it%3a6969&tr=udp%3a%2f%2ftracker.ccc.de%3a80 "title=" Download this torrent using magnet "></a> <a href="//torrents. piratebay.se/7782194/the_walking_dead_season_3_episodes_1-3_ Hdtv-x264.7782194.tpb.torrent "title=" Download Seed "></a>
<font class= "Detdesc" > has been uploaded <b>3 minutes ago </b>, size 2 GiB, uploading <a class= "Detdesc" href= "/user/paridha/" title = "Browse Paridha" >paridha</a></font>
</td>
<TD align= "right" >0</td>
<TD align= "right" >0</td>
</tr>
The following is a regular expression used to extract the content from the HTML code. To the regular expression does not understand the classmate, may go to http://docs.python.org/2/library/re.html to understand.
There is a reason why you should use regular expressions without other tools that parse HTML or the DOM tree. I have tried to use BEAUTIFULSOUP3 to extract content, and later found that the speed is really slow ah, a second can handle 100 content, is my computer's limit ... and change the regular expression, compile after processing content, speed directly to kill it seconds!
How do I write my regular expression when I have so much content to extract?
According to my previous experience,". *?" or ". +?" This is a very good thing. but also need to pay attention to some small problems, the actual use of the time will know
For the TR tag code above, I first need to get my expression to match the symbol that is
<tr>
The beginning of the content, of course, can be something else, as long as you don't miss what you need. Then I want to match the content is below this, get the video classification.
(<a href= "/browse/205" title= "more in this directory" > TV </a>)
And then I'm going to match the resource link,
<a href= "class=" Detlink "title=" ... ">...</a>
to other resource information,
Font class= "Detdesc" > has been uploaded <b>3 minutes ago </b>, size 2 GiB, uploaded by
Last match
</tr>
Done!
Of course, the final match can not be expressed in the regular expression, as long as the start position is correct, and then the location of the information to get the correct position.
Friends who know more about regular expressions may know how to write them. I'll show you how I write the expression process,
It's so simple, the results come out, I feel very happy.
Of course, such a design of the crawler is targeted, directed to crawl the content of a site. nor does any reptile filter the collected links. You can often use the BFS (width-first search algorithm) to crawl all page links for a site.
Complete Python crawler code to crawl a bay's latest 10-page video resource:
# Coding:utf8
Import Urllib2
Import re
Import Pymongo
db = Pymongo. Connection (). Test
url = ' http://a piratebay.se/browse/200/%d/3 '
Find_re = Re.compile (R ' <tr>.+?\ (. +?) > (. +?) </a>.+?class= "Detlink". +? " > (. +?) </a>.+?<a href= "(magnet:.+?)" +? Uploaded <b> (. +?) </b>, size (. +?), ', Re. Dotall)
# Redirect to 10 pages of the latest video resources
For I in range (0, 10):
u = URL% (i)
# Download Data
html = Urllib2.urlopen (u). Read ()
# Find resource Information
For x in Find_re.findall (HTML):
Values = Dict (
Category = X[0],
Name = X[1],
Magnet = X[2],
Time = X[3],
Size = X[4]
)
# Save to Database
Db.priate.save (values)
print ' done! '
The above code only for the idea of the show, the actual operation to use the MongoDB database, and may not be able to access a bay site to get the normal results.
Therefore, the film to use the crawler is not difficult to write, it is difficult to get the data after how to organize to obtain useful information. For example, how to match a movie's information with a resource, how to establish a connection between the video library and the link, all these need to constantly try various methods, and finally choose a more reliable.
There was a classmate who sent emails to get the source code of my reptile.
If I really gave, my reptile on a few hundred line code, a piece of A4 paper, he won't say, pit dad Ah!!! ......
It is said that this is the era of information explosion, so the more than the data mining ability is strong
Okay, so that's the problem. Learning Excavator (data) technology which is strong?