From the comment in the previous article, it seems that many children's shoes are more concerned with the crawler source code. This article provides a detailed record on how to use Python to write simple web crawlers to capture video download resources, I hope it will help you. I first came into contact with crawlers. This was in February of this year. at that time, I wrote a blog search engine and the crawlers used were quite intelligent, at least it's a little higher than the crawlers used on this site when movies come!
Back to the topic of crawling with Python.
Python has always been one of my main scripting languages. The Python language is concise and flexible, and the standard library has powerful functions. it can be used as a calculator, text encoding and conversion, image processing, batch download, and batch text processing. In a word, I like it very much. the more I use it, the more I get started. I don't tell anyone about such a useful tool...
Because of its powerful string processing capabilities and the existence of urllib2, cookielib, re, and threading modules, it is easy to use Python to write crawlers. To what extent is it simple. I told a classmate that the number of crawlers I used to write movies and a bunch of scattered script code lines in data sorting cannot exceed 1000, this website only has 150 lines of code. Because the crawler code is on another 64-bit Black Apple, it is not listed. it only lists the website code on VPS, which is written by tornadoweb framework.
[xiaoxia@307232 movie_site]$ wc -l *.py template/*
156 msite.py
92 template/base.html
79 template/category.html
94 template/id.html
47 template/index.html
77 template/search.html
The following shows the crawler compiling process.The following content is only for communication and learning.
Take the latest video download resource in a bay as an example. Its website is
Http: // a piratebay. se/browse/200
Because the webpage contains a large number of advertisements, only the body content is pasted:
For a python crawler, download the source code of this page. a line of code is sufficient. The urllib2 library is used here.
>>> Import urllib2
>>> Html = urllib2.urlopen ('http: // a piratebay. se/browse/100'). read ()
>>> Print 'size is ', len (html)
Size is 52977
Of course, you can also use the system function in the OS module to call the wget command to download webpage content. it is very convenient for those who have mastered the wget or curl tool.
Use Firebug to observe the webpage structure. you can see that html in the body is a table. Each resource is a tr tag.
For each resource, the following information needs to be extracted:
1. video classification
2. resource name
3. resource links
4. resource size
5. Upload time
That's enough. if necessary, you can add more.
First, extract the code in a tr tag to observe it.
Video
(TV)
The Walking Dead Season 3 Episodes 1-3 HDTV-x264
Uploaded3 minutes ago, Size 2 GiB, uploaded by paridha
0
0
The following uses a regular expression to extract content from html code. Do not understand the regular expression of students, you can go to http://docs.python.org/2/library/re.html to understand.
There is a reason for using regular expressions instead of using other tools to parse HTML or DOM trees. I tried to use BeautifulSoup3 to extract the content. later I found that the speed was really slow. it was already the limit on my computer to be able to process 100 pieces of content in one second... Instead, the regular expression is replaced, and the content after compilation is processed is killed in seconds!
How can I write regular expressions to extract so much content?
Based on my past experience,". *?" Or ". + ?" This is a good thing.However, pay attention to some minor issues and you will know when using them.
For the above tr tag code, I first need to make the symbol that my expression matches
It indicates the beginning of the content. of course it can also be something else, as long as you don't miss the content you need. Then I want to match the following content to obtain the video category.
(TV)
Then I want to match the resource link,
...
Go to other resource information,
Font class = "detDesc"> uploaded3 minutes ago, Size 2 GiB, uploaded
Last match
Success!
Of course, the final match does not need to be expressed in the regular expression. as long as the start position is correctly located, the position for obtaining information is correct.
You may know how to write regular expressions. Let me Show the expression processing process I wrote,
It's that simple. as a result, I feel very happy.
Of course, the crawler designed in this way is targeted and targeted to crawl the content of a website.No crawler does not filter the collected links. Generally, you can use BFS (width-first search algorithm) to crawl all the page links of a website.
The complete Python crawler code crawls the latest 10-page video resource of a bay:
# Coding: utf8
Import urllib2
Import re
Import pymongo
Db = pymongo. Connection (). test
Url = 'http: // a piratebay. se/browse/200/% d/3'
Find_re = re. compile (r'. +? \ (. +? "> (. + ?). +? Class = "detLink". +? "> (. + ?). +? (. + ?), Size (. + ?), ', Re. DOTALL)
# Targeted crawling of the latest 10 pages of video resources
For I in range (0, 10 ):
U = url % (I)
# Download data
Html = urllib2.urlopen (u). read ()
# Find resource information
For x in find_re.findall (html ):
Values = dict (
Category = x [0],
Name = x [1],
Magnet = x [2],
Time = x [3],
Size = x [4]
)
# Save to database
Db. priate. save (values)
Print 'Done! '
The above code is only for demonstration of ideas. it is used in the mongodb database in actual operation, and may not be able to get normal results because the website in a bay cannot be accessed.
Therefore, it is not difficult for a website to use crawlers to write movies. what is difficult is how to sort and obtain useful information after obtaining data. For example, how to match a video information with a resource, and how to establish associations between the video information library and video links requires constant efforts in various ways, and finally select a more reliable one.
Someone sent an email to get the source code of my crawler if they wanted to spend money.
If I did, my crawler would have had several hundred lines of code, an A4 paper. he wouldn't say that, ah, ah !!!......
It is said that the present is the age of information explosion, so it is better than who can mine data.
Well, the question is, which of the following is the best excavator (data) technology?