Python to write simple web crawler crawl video

Source: Internet
Author: User
Tags processing text
From the comments in the previous article it seems that many children's shoes are more concerned about the source code of the crawler. All this article on the use of Python to write a simple crawler crawl video download resources to do a very detailed record, almost every step is introduced to everyone, I hope to be helpful to everyone

My first contact with the crawler this thing is in this year's May, then wrote a blog search engine, the use of the crawler is also very smart, at least more than the film to the station used by the crawler level is much higher!

Back to the topic of writing crawlers in Python.

Python has always been my primary scripting language, not one of them. Python's language is simple and flexible, the standard library is powerful, can be used as a calculator, text encoding conversion, image processing, batch download, batch processing text and so on. In short I like, also the more the use of a tool, so good, the General people I do not tell him ...

Because of its powerful string processing capabilities, and the urllib2,cookielib,re,threading of these modules, it is easy to use Python to write crawlers. To what extent is it simple. I was talking to a classmate, I wrote a movie to use a few reptiles and data compiled a bunch of fragmented script code line number of not more than 1000 lines, write a movie this site is only 150来 line code. Because the crawler code on another 64-bit black apple, so it is not listed, only a list of VPS Internet station code, TORNADOWEB framework written

[xiaoxia@307232 movie_site]$ wc-l *.py template/*  156 msite.py [   template/base.html] template/   category.html   94 template/id.html   template/index.html   template/search.html

Here is a direct show of the crawler's writing process. The following content is for Exchange study only, no other meaning.

Take the latest video download of a bay for example, its URL is

HTTP//A piratebay.se/browse/200

Because there is a lot of ads on this page, just stick to the text part of the content:

For a python crawler, download this page source code, a line of code enough. The URLLIB2 library is used here.

>>> Import urllib2>>> html = Urllib2.urlopen (' http://A piratebay.se/browse/200 '). Read () >> > print ' size is ', Len (HTML) size is 52977

Of course, you can also use the OS module in the system function call wget command to download the content of the Web page, for mastering the wget or Curl tool students are very convenient.

Using Firebug to observe the structure of the Web page, you can know that the body part HTML is a table. Each of these resources is a TR tag.

For each resource, the information that needs to be extracted is:

1. Video classification
2. Resource Name
3. Resources link
4. Resource size
5. Upload Time

That's all that's enough, and if you need it, you can increase it.

First, take a look at the code in the TR tag.

<tr> <td class= "Vertth" > <center> <a href= "/browse/200" title= "more in this directory" > video </a><br/&    Gt (<a href= "/browse/205" title= "more in this directory" > TV </a>) </center> </td> <td><p class= "Detname "> <a href="/torrent/7782194/the_walking_dead_season_3_episodes_1-3_hdtv-x264 "class=" DetLink "title=" details the  Walking Dead Season 3 episodes 1-3 hdtv-x264 ">the Walking Dead Season 3 episodes 1-3 Hdtv-x264</a></p><a Href= "magnet:?xt=urn:btih:4f63d58e51c1a4a997c6f099b2b529bdbba72741&dn=the+walking+dead+season+3+episodes+ 1-3+hdtv-x264&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2ftracker.publicbt.com%3a80 &tr=udp%3a%2f%2ftracker.istole.it%3a6969&tr=udp%3a%2f%2ftracker.ccc.de%3a80 "title=" Download this torrent Using magnet "></a> <a Href= "//torrents. A piratebay.se/7782194/the_walking_dead_season_3_episodes_1-3_hdtv-x264.7782194.tpb.torrent "title=" Download Seed "></a> <font class=" Detdesc "> uploaded <b>3 minutes ago </b>, size 2 GiB, uploader <a class= "Detdesc" href= "/user/paridha/" title= "Browse Paridha" >paridha</a></font> </td> <td align= ' Right ' >0</td> <td align= ' right ' >0</td> </tr>

The following uses a regular expression to extract the contents of the HTML code. The regular expression does not understand the classmate, can go to http://docs.python.org/2/library/re.html to understand.

Why use regular expressions instead of other tools that parse HTML or dom trees for a reason. I tried to use BEAUTIFULSOUP3 to extract the content, and later found that the speed is slow to death Ah, a second to deal with 100 content, is already the limit of my computer ... and changed the regular expression, compiled after processing the content, speed directly to kill it in seconds!

How do I write my regular expression when I extract so much content?

According to my previous experience,". *?" or ". +?" This thing is very good. but also pay attention to a few small problems, the actual use of the time will know

For the above TR tag code, I first need to make my expression match to the symbol that is

<tr>

The beginning of the content, of course, can be anything else, just don't miss the content you need. Then I'm going to match the content below to get the video classification.

(<a href= "/browse/205" title= "more in this directory" > TV </a>)

And then I'm going to match the resource link,

<a href= "..." class= "Detlink" title= "..." >...</a>

to other resource information,

Font class= "Detdesc" > uploaded <b>3 minutes ago </b>, size 2 GiB, uploader

Last match

</tr>

Done!

Of course, the final match can not be expressed in the regular expression, as long as the starting position is correct, the location of the subsequent access to information is correct.

A friend who knows more about regular expressions may know how to write. I show you the expression process I wrote,

It was so simple, the result came out, I feel very happy.

Of course, this design crawler is targeted, directed to crawl the content of a site. nor is there a reptile that does not filter the collected links. You can usually use the BFS (width-first search algorithm) to crawl all the page links of a site.

Complete Python crawler code to crawl a bay's latest 10-page video resource:

# coding:utf8import urllib2import reimport pymongodb = Pymongo. Connection (). Testurl = ' http://a piratebay.se/browse/200/%d/3 ' Find_re = Re.compile (R ' <tr>.+?\ (. +? ") > (. +?) </a>.+?class= "Detlink". +? " > (. +?) </a>.+?<a href= "(magnet:.+?)". + <b> uploaded (. +?) </b>, size (. +?), ', Re. Dotall) # directed crawl 10 page latest video resource for I in range (0):    u = url% (i)    # download Data    html = urllib2.urlopen (u). Read ()    # Find Resources Information for    x in Find_re.findall (HTML):        values = dict (            category = X[0],            name = x[1],            magnet = x[2],            ti me = x[3],            size = x[4]        )        # Save to Database        Db.priate.save (values) print ' done! '

The above code is only for the idea to display, the actual operation to use to the MongoDB database, and may not have access to a bay site and can not get normal results.

Therefore, the film to the site used by the crawler is not difficult to write, difficult is to obtain data after how to organize to obtain useful information. For example, how to match a movie information with a resource, how to create an association between a video library and a link between videos, all of which need to be tried in a variety of ways, and then choose a more reliable one.

There was a classmate sent an e-mail want to spend money also want to get my crawler source code.
If I really give, my crawler on a few hundred lines of code, a A4 paper, he will not say, Pit dad Ah!!! ......

All say that now is the era of information explosion, so compared to who the data mining ability is strong

Well, then the problem comes. Learning Excavator (data) technology in the end which strong?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.