Always have love to watch the habit of watching the United States, on the one hand to exercise English listening, to pass the time. Before the video site can be seen on the online, but since the SARFT restrictions, the import of the United States drama, such as the British drama does not seem to be the same as before synchronized update. However, as a house Diao I do not have a play after it, so casually check on the Internet to find a can use the Thunder download the American play download site "Every day American drama", all kinds of resources casually download, recently fascinated by the BBC's high-definition documentary, Nature beautiful Don't want to.
Although found a resource site can be downloaded, but each time to open the browser, enter the Web site, find the American drama, and then click on the link to download. Long time to feel the process is very cumbersome, and sometimes the site links will not open, there will be a bit of trouble. Just have been learning python crawler , so today on a whim to write a crawler, crawl the site all the United States play links, and saved in a text document, want to which play directly open replication link to the Thunderbolt can be downloaded.
Actually started to write that find a URL, using requests to open the crawl download link, starting from the homepage crawl complete station. However, many duplicate links, as well as the URL of its website is not what I think the rules, wrote a half-day also did not write I want the kind of divergent crawler, perhaps the heat of their own not yet, continue to work hard ...
Later found that its TV links are in the article, and then the post URL has a number, like such a http://cn163.net/archives/24016/, so witty I used to write before the reptilian experience, The solution is to automatically generate the URL, the number behind it can not be changed, and each play is unique, so try a little bit about how many articles, and then use the range function directly generated number to construct the URL.
But a lot of URLs don't exist, so will hang up, do not worry, we use the requests, its own status_code is used to determine the status of the request returned, so as long as the return of the status code is 404 we have to skip it, the other are in the crawl link, This solves the problem of the URL.
The following is the implementation code for the above steps.
def get_urls (self):
try: For
I in range (2015,25000):
base_url= ' http://cn163.net/archives/'
url= Base_url+str (i) + '/'
if Requests.get (URL). Status_code = = 404:
Continue
Else:
self.save_links (URL)
except exception,e:
Pass
The rest of it went very well, the Internet to find the predecessors wrote similar crawler, but only to crawl an article, so the reference to its regular expression. I used the BeautifulSoup has not been a good effect, so decisively abandoned, art is infinite ah. But the effect is not so ideal, there are about half of the links can not be properly crawled, but also need to continue to optimize.
#-*-Coding:utf-8-*-Import requests import re import sys import threading import time reload (SYS) sys.setdefaultencod ing (' utf-8 ') class archives (object): Def save_links (Self,url): Try:data=requests.get (url,timeout=3) c Ontent=data.text link_pat= ' "(ed2k://\|file\|[ ^"]+?\. (s\d+) (e\d+)
[^ "]+?1024x\d{3}[^"]+?) "' Name_pat=re.compile (R '
Full version of the code, the use of multiple threads, but it doesn't feel good, because Python's Gil's sake, it seems to have more than 20,000 plays, this thought will take a long time to crawl completed, but to remove the URL error and did not match to the total crawl time of 20 minutes. I would have liked to use Redis to crawl on two Linux, but after a while it didn't feel necessary, so let's go ahead and get more data later.
And the process of encountering a very tortured my problem is to save the file name, you must complain about this, TXT text Format file name can have spaces, but can not have slashes, backslashes, parentheses, and so on. This is the problem, all the time in the morning spent on this, at first I thought it was to crawl data errors, after a half-day check to find that the name of the play is crawling with a slash, this can put me pit bitter.
The author of this article: Code Rural Network – Schohau