Python Crawler bulk Download American drama from everyone movie Hr-hdtv

Source: Internet
Author: User
Tags xml parser

I prefer to watch American dramas. In particular, like everyone on TV HR-HDTV 1024 resolution of the HD double-word American drama, here wrote a script to obtain the full HR-HDTV of the designated American drama ed2k download link. And in order to write to the text file, for download tool for bulk download. For example, with thunder. First turn on the Thunder, then copy all the download link to the Clipboard, Thunderbolt will monitor the Clipboard to create a new whole task. If Thunder does not have its own active monitoring, you can click New and then paste the link. Python source code such as the following. With the Python3:

# Python3 implementation, the following instances of the 3-part U.S. drama climbed up about ten simport urllib.requestimport redef get_links (URL, name= ' Yyets '): data = Urllib.request.urlopen (URL). read (). Decode () pattern = ' "(ed2k://\|file\|[ ^"]+?\. (s\d+) (e\d+) [^"]+?    

1024x\d{3}[^ "]+?)" ' Linksfind = Set (Re.findall (pattern, data)) Linksdict = {} total = Len (linksfind) for I in Linksfind:links Dict[int (I[1][1:3]) * + int (i[2][1:3])] = I with open (name + '. txt ', ' W ') as F:for i in sorted (list (LINKSDI Ct.keys ()): F.write (linksdict[i][0] + ' \ n ') print (linksdict[i][0]) print ("Get Download links of: ", Name, str (total)) if __name__ = = ' __main__ ': #----------Jailbreak, Shameless, Game of Thrones---------------------------get_links (' http ://www.yyets.com/resource/10004 ', ' Prision_break ') get_links (' http://www.yyets.com/resource/10760 ', ' shameless ') Get_links (' http://www.yyets.com/resource/d10733 ', ' Game_of_thrones ') print (' All is okay! ')

This Python crawler is shorter, using the urllib.request and re two modules, the former responsible for crawling the Web page, the latter is responsible for parsing the text.

Everyone has no restrictions on crawler access. So there is no need to change the HTTP head user-agent, for some Screen crawler Web page, you need to change the value of User-agent . A procedure such as the following: Constructs a request object with the constructor of the request class in the Urllib.request constructor, assigns itself to the User-agent attribute in the Headers (dictionary), and then passes the object to the Urlopen of the module ( In will be able to disguise the crawler as a browser to crawl Web pages. Example. CSDN is a block of reptiles. You need to change the value of user-agent, such as the following:

Import Urllib.requesturl = ' http://blog.csdn.net/csdn ' head={' user-agent ': ' mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; trident/6.0) '}req = urllib.request.Request (URL, headers=head) data = Urllib.request.urlopen (req, timeout=2). Read (). Decode () print (data)

Crawling the page is the parsing of the HTML document. The use of the normal form module here is convenient for a specific single content. It is assumed that more complex parsing is possible with pyquery or Beautiful Soup, which is a html/xml parser written in Python. Among them , Pyquery is a jquery style, which is more useful .

About the regular form here is a tool regexbuddy, which has a powerful normal-form debugging function, which is used to debug the normal form in the script above. This blog post on Python is very good: Python is an express guide.

Want to further strengthen the function of reptile, can use reptile frame scrapy, this is the official Tutoria of Scrapy. What's more, suppose that the content of the Web page is JavaScript generation. Then need a JS engine, PyV8 can take to try, and then there is a crawler based on JS. such as with Casperjs and Phantomjs.

"Address: http://blog.csdn.net/thisinnocence/article/details/39997883"


Python Crawler bulk Download American drama from everyone movie Hr-hdtv

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.