Use Python to write simple web crawler crawl video Download resources

Source: Internet
Author: User
Tags processing text

The first time I touched a reptile this thing was in May this year, when I wrote a blog search engine. The crawler used is also very intelligent, at least more than the film to the station used by the crawler level is much higher!

Back to the topic of writing crawlers in Python.

Python has always been my primary scripting language, not one of them.

Python's language is simple and flexible, and the standard library is powerful. Ordinary can be used as a calculator, text encoding conversion, image processing, batch download, batch processing text and so on. In short I like very much, also the more the use of a tool, so good, the General people I do not tell him.

。 Many other network programming tutorials please surf the web

Because of its powerful string processing capabilities, as well as the existence of these modules urllib2,cookielib,re,threading. Writing crawlers in Python is easy to reverse. To what extent is it simple.

I was talking to a classmate. I'm writing a movie. A few reptiles and data fragmented a bunch of script code lines in total no more than 1000 lines, write a movie this site has only 150来 lines of code. Because the crawler code on another 64-bit black apple, so it is not listed, just list the VPS on the site code. Written by the Tornadoweb framework.

[Email protected] movie_site]$ wc-l *.py template/* 156 msite.py The template/base.html 94 template/id.html template/index.html template/search.html

Here is a direct show of the crawler's writing process.

The following content is for Exchange study only, no other meaning.

Take the latest video download of a bay for example, its URL is

HTTP//A piratebay.se/browse/200

Because there is a lot of ads on this page, just stick to the text part of the content:

For a python crawler, download the source of this page, one line of code enough.

The URLLIB2 library is used here.

>>> Import urllib2>>> html = Urllib2.urlopen (' http://A piratebay.se/browse/200 '). Read () >> > print ' size is ', Len (HTML) size is 52977

Of course, you can also use the system function in the OS module to invoke the wget command to download the Web content. It is very convenient for students who have mastered the wget or Curl tool.

Using Firebug to observe the structure of the Web page, you can know that the body part HTML is a table.

Each resource is a TR tag.

and for each resource. Information that needs to be extracted is:

1. Video classification
2. Resource Name
3. Resources link
4. Resource size
5. Upload Time

That's all it takes, assuming there's a need. can also be added.

First, take a look at the code in the TR tag.

&LT;TR&GT;&NBSP;&NBSP;&LT;TD class= "Vertth" >   <center>    <a href= "/browse/200" title= "many other in this folder" > Video </a><br/>     (<a href= "/browse/205" Title= "Many other in this folder" > TV </a>)    </center>  </td>  <td ><div class= "Detname" >   <a href= "/torrent/7782194/the_walking_dead_season_3_episodes _1-3_hdtv-x264 "class=" Detlink "title=" details the Walking Dead Season 3 episodes 1-3 hdtv-x264 ">the Walking Dead Season 3 E Pisodes 1-3 hdtv-x264</a></div><a href= "Magnet:?xt=urn:btih : 4f63d58e51c1a4a997c6f099b2b529bdbba72741&dn=the+walking+dead+season+3+episodes+1-3+hdtv-x264&tr=udp% 3a%2f%2ftracker.openbittorrent.com%3a80&tr=udp%3a%2f%2ftracker.publicbt.com%3a80&tr=udp%3a%2f% 2ftracker.istole.it%3a6969&tr=udp%3a%2f%2ftracker.ccc.de%3a80 "title=" Download this torrent using magnet "> </a>   <a href="//torrents. piratebay.se/ 7782194/the_walking_dead_season_3_episodes_1-3_hdtv-x264.7782194.tpb.torrent "title=" Download Seed "></a>   <font class=" Detdesc "> uploaded <b>3  minutes ago </b>, size 2 gib, uploader <a class=" Detdesc "href="/user/paridha/"title=" Browse Paridha ">paridha</a></font>  </td>  <td align=" right ">0</td &GT;&NBSP;&NBSP;&LT;TD align= "right" >0</td> </tr>

The following is an explicit table to extract the contents of the HTML code. Students who are not familiar with the form. can go to http://docs.python.org/2/library/re.html to understand.

There is a reason why you should use a regular form rather than some other tool that parses the HTML or DOM tree. I tried to use BEAUTIFULSOUP3 to extract the content, and later found that the speed was slow to death. 100 items can be processed in one second. is already the limit of my computer.

。。 and changed the normal form, after compiling the content, speed directly to kill it in seconds!

How do I write my regular form to extract so much content?

According to my previous experience, ". *?" or ". +?" This thing is very good. Just have to pay attention to a few small problems, the actual use of the time will know

For the TR tag code above. The first thing I need to do is match my expression to the symbol

<tr>

The beginning of the content, of course, can be other, just do not miss the required content.

Then I'm going to match the content below to get the video classification.

(<a href= "/browse/205" title= "many other in this folder" > TV </a>)

And then I'm going to match the resource link,

<a href= "..." class= "Detlink" title= "..." >...</a>

to other resource information,

Font class= "Detdesc" > uploaded <b>3 minutes ago </b>, size 2 GiB, uploader

Last match

</tr>

Done.

Of course. The final match can not be expressed in the form of a statement, just to start positioning the correct position, the back of the location of the information is correct.

A friend who knows more about the normal form. Probably know how to write it. I show you the expression process I wrote,

It's so simple. The result came out, I feel very happy.

Of course, this design crawler is targeted, directed to crawl the content of a site. No matter what a reptile does not filter the collected links.

You can usually use the BFS (width-first search algorithm) to crawl all page links for a site.

Complete Python crawler code to crawl a bay's latest 10-page video resource:

# coding:utf8import urllib2import reimport pymongodb = Pymongo. Connection (). Testurl = ' http://a piratebay.se/browse/200/%d/3 ' Find_re = Re.compile (R ' <tr>.+?

\(.+?

"> (. +?) </a>.+?class= "Detlink". +? " > (. +?)

) </a>.+?

<a href= "(magnet:.+?)

) ". +? Uploaded <b> (. +?) </b>, size (. +?)

), ', Re.    Dotall) # directed crawl 10 page latest video resource for I in range (0): U = URL% (i) # download Data html = urllib2.urlopen (u). Read () # Find resource Information For x in Find_re.findall (HTML): values = dict (category = X[0], name = X[1], mag NET = x[2], time = x[3], size = x[4]) # Save to Database Db.priate.save (values) print ' Do ne! '

The above code is only for thought display. The actual execution of the use of the MongoDB database, at the same time may be unable to access a bay site and can not get normal results.

So, the film came to the site to use the crawler is not difficult to write, the difficulty is to obtain data after how to organize to obtain practical information. Like what. How to match a video message with a resource. How to create an association between a video library and a link between videos requires a constant attempt at various methods. The final election is more reliable.

There was a classmate e-mail want to spend money also want to get my crawler source code.
If I did, my crawler would have a few hundred lines of code. A piece of A4 paper. He won't say, Pit daddy.!

......

All say that now is the era of information explosion, so than the data mining ability of who is strong

All right. So that's the problem. Learning Excavator (data) technology which is strong?

Originating From: http://www.wangwenzl.cn

Use Python to write simple web crawler crawl video Download resources

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.