Python crawler crawls The implementation code of the American drama website

Python crawler crawls The implementation code of the American drama website _python

Last Update:2017-01-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Always have love to watch the habit of watching the United States, on the one hand to exercise English listening, to pass the time. Before the video site can be seen on the online, but since the SARFT restrictions, the import of the United States drama, such as the British drama does not seem to be the same as before synchronized update. However, as a house Diao I do not have a play after it, so casually check on the Internet to find a can use the Thunder download the American play download site "Every day American drama", all kinds of resources casually download, recently fascinated by the BBC's high-definition documentary, Nature beautiful Don't want to.

Although found a resource site can be downloaded, but each time to open the browser, enter the Web site, find the American drama, and then click on the link to download. Long time to feel the process is very cumbersome, and sometimes the site links will not open, there will be a bit of trouble. Just have been learning python crawler , so today on a whim to write a crawler, crawl the site all the United States play links, and saved in a text document, want to which play directly open replication link to the Thunderbolt can be downloaded.

Actually started to write that find a URL, using requests to open the crawl download link, starting from the homepage crawl complete station. However, many duplicate links, as well as the URL of its website is not what I think the rules, wrote a half-day also did not write I want the kind of divergent crawler, perhaps the heat of their own not yet, continue to work hard ...

Later found that its TV links are in the article, and then the post URL has a number, like such a http://cn163.net/archives/24016/, so witty I used to write before the reptilian experience, The solution is to automatically generate the URL, the number behind it can not be changed, and each play is unique, so try a little bit about how many articles, and then use the range function directly generated number to construct the URL.

But a lot of URLs don't exist, so will hang up, do not worry, we use the requests, its own status_code is used to determine the status of the request returned, so as long as the return of the status code is 404 we have to skip it, the other are in the crawl link, This solves the problem of the URL.

The following is the implementation code for the above steps.

def get_urls (self):
  try: For
    I in range (2015,25000):
      base_url= ' http://cn163.net/archives/'
      url= Base_url+str (i) + '/'
      if Requests.get (URL). Status_code = = 404:
        Continue
      Else:
        self.save_links (URL)
  except exception,e:
    Pass

The rest of it went very well, the Internet to find the predecessors wrote similar crawler, but only to crawl an article, so the reference to its regular expression. I used the BeautifulSoup has not been a good effect, so decisively abandoned, art is infinite ah. But the effect is not so ideal, there are about half of the links can not be properly crawled, but also need to continue to optimize.

#-*-Coding:utf-8-*-Import requests import re import sys import threading import time reload (SYS) sys.setdefaultencod ing (' utf-8 ') class archives (object): Def save_links (Self,url): Try:data=requests.get (url,timeout=3) c Ontent=data.text link_pat= ' "(ed2k://\|file\|[ ^"]+?\. (s\d+) (e\d+)
      [^ "]+?1024x\d{3}[^"]+?) "' Name_pat=re.compile (R '

Full version of the code, the use of multiple threads, but it doesn't feel good, because Python's Gil's sake, it seems to have more than 20,000 plays, this thought will take a long time to crawl completed, but to remove the URL error and did not match to the total crawl time of 20 minutes. I would have liked to use Redis to crawl on two Linux, but after a while it didn't feel necessary, so let's go ahead and get more data later.

And the process of encountering a very tortured my problem is to save the file name, you must complain about this, TXT text Format file name can have spaces, but can not have slashes, backslashes, parentheses, and so on. This is the problem, all the time in the morning spent on this, at first I thought it was to crawl data errors, after a half-day check to find that the name of the play is crawling with a slash, this can put me pit bitter.

The author of this article: Code Rural Network – Schohau

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python crawler crawls The implementation code of the American drama website _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python crawler crawls The implementation code of the American drama website _python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support