Python Crawler bulk Download American drama from everyone movie Hr-hdtv

Last Update:2017-04-22 Source: Internet

Author: User

Tags xml parser

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I prefer to watch American dramas. In particular, like everyone on TV HR-HDTV 1024 resolution of the HD double-word American drama, here wrote a script to obtain the full HR-HDTV of the designated American drama ed2k download link. And in order to write to the text file, for download tool for bulk download. For example, with thunder. First turn on the Thunder, then copy all the download link to the Clipboard, Thunderbolt will monitor the Clipboard to create a new whole task. If Thunder does not have its own active monitoring, you can click New and then paste the link. Python source code such as the following. With the Python3:

# Python3 implementation, the following instances of the 3-part U.S. drama climbed up about ten simport urllib.requestimport redef get_links (URL, name= ' Yyets '): data = Urllib.request.urlopen (URL). read (). Decode () pattern = ' "(ed2k://\|file\|[ ^"]+?\. (s\d+) (e\d+) [^"]+?     
  1024x\d{3}[^ "]+?)" ' Linksfind = Set (Re.findall (pattern, data)) Linksdict = {} total = Len (linksfind) for I in Linksfind:links Dict[int (I[1][1:3]) * + int (i[2][1:3])] = I with open (name + '. txt ', ' W ') as F:for i in sorted (list (LINKSDI  Ct.keys ()): F.write (linksdict[i][0] + ' \ n ') print (linksdict[i][0]) print ("Get Download links of: ", Name, str (total)) if __name__ = = ' __main__ ': #----------Jailbreak, Shameless, Game of Thrones---------------------------get_links (' http    ://www.yyets.com/resource/10004 ', ' Prision_break ') get_links (' http://www.yyets.com/resource/10760 ', ' shameless ') Get_links (' http://www.yyets.com/resource/d10733 ', ' Game_of_thrones ') print (' All is okay! ')

This Python crawler is shorter, using the urllib.request and re two modules, the former responsible for crawling the Web page, the latter is responsible for parsing the text.

Everyone has no restrictions on crawler access. So there is no need to change the HTTP head user-agent, for some Screen crawler Web page, you need to change the value of User-agent . A procedure such as the following: Constructs a request object with the constructor of the request class in the Urllib.request constructor, assigns itself to the User-agent attribute in the Headers (dictionary), and then passes the object to the Urlopen of the module ( In will be able to disguise the crawler as a browser to crawl Web pages. Example. CSDN is a block of reptiles. You need to change the value of user-agent, such as the following:

Import Urllib.requesturl = ' http://blog.csdn.net/csdn ' head={' user-agent ': ' mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; trident/6.0) '}req = urllib.request.Request (URL, headers=head) data = Urllib.request.urlopen (req, timeout=2). Read (). Decode () print (data)

Crawling the page is the parsing of the HTML document. The use of the normal form module here is convenient for a specific single content. It is assumed that more complex parsing is possible with pyquery or Beautiful Soup, which is a html/xml parser written in Python. Among them , Pyquery is a jquery style, which is more useful .

About the regular form here is a tool regexbuddy, which has a powerful normal-form debugging function, which is used to debug the normal form in the script above. This blog post on Python is very good: Python is an express guide.

Want to further strengthen the function of reptile, can use reptile frame scrapy, this is the official Tutoria of Scrapy. What's more, suppose that the content of the Web page is JavaScript generation. Then need a JS engine, PyV8 can take to try, and then there is a crawler based on JS. such as with Casperjs and Phantomjs.

"Address: http://blog.csdn.net/thisinnocence/article/details/39997883"

Python Crawler bulk Download American drama from everyone movie Hr-hdtv

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More