I prefer to watch American dramas. In particular, like everyone on TV HR-HDTV 1024 resolution of the HD double-word American drama, here wrote a script to obtain the full HR-HDTV of the designated American drama ed2k download link. And in order to write to the text file, for download tool for bulk download. For example, with thunder. First turn on the Thunder, then copy all the download link to the Clipboard, Thunderbolt will monitor the Clipboard to create a new whole task. If Thunder does not have its own active monitoring, you can click New and then paste the link. Python source code such as the following. With the Python3:
# Python3 implementation, the following instances of the 3-part U.S. drama climbed up about ten simport urllib.requestimport redef get_links (URL, name= ' Yyets '): data = Urllib.request.urlopen (URL). read (). Decode () pattern = ' "(ed2k://\|file\|[ ^"]+?\. (s\d+) (e\d+) [^"]+?
1024x\d{3}[^ "]+?)" ' Linksfind = Set (Re.findall (pattern, data)) Linksdict = {} total = Len (linksfind) for I in Linksfind:links Dict[int (I[1][1:3]) * + int (i[2][1:3])] = I with open (name + '. txt ', ' W ') as F:for i in sorted (list (LINKSDI Ct.keys ()): F.write (linksdict[i][0] + ' \ n ') print (linksdict[i][0]) print ("Get Download links of: ", Name, str (total)) if __name__ = = ' __main__ ': #----------Jailbreak, Shameless, Game of Thrones---------------------------get_links (' http ://www.yyets.com/resource/10004 ', ' Prision_break ') get_links (' http://www.yyets.com/resource/10760 ', ' shameless ') Get_links (' http://www.yyets.com/resource/d10733 ', ' Game_of_thrones ') print (' All is okay! ')
This Python crawler is shorter, using the urllib.request and re two modules, the former responsible for crawling the Web page, the latter is responsible for parsing the text.
Everyone has no restrictions on crawler access. So there is no need to change the HTTP head user-agent, for some Screen crawler Web page, you need to change the value of User-agent . A procedure such as the following: Constructs a request object with the constructor of the request class in the Urllib.request constructor, assigns itself to the User-agent attribute in the Headers (dictionary), and then passes the object to the Urlopen of the module ( In will be able to disguise the crawler as a browser to crawl Web pages. Example. CSDN is a block of reptiles. You need to change the value of user-agent, such as the following:
Import Urllib.requesturl = ' http://blog.csdn.net/csdn ' head={' user-agent ': ' mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; trident/6.0) '}req = urllib.request.Request (URL, headers=head) data = Urllib.request.urlopen (req, timeout=2). Read (). Decode () print (data)
Crawling the page is the parsing of the HTML document. The use of the normal form module here is convenient for a specific single content. It is assumed that more complex parsing is possible with pyquery or Beautiful Soup, which is a html/xml parser written in Python. Among them , Pyquery is a jquery style, which is more useful .
About the regular form here is a tool regexbuddy, which has a powerful normal-form debugging function, which is used to debug the normal form in the script above. This blog post on Python is very good: Python is an express guide.
Want to further strengthen the function of reptile, can use reptile frame scrapy, this is the official Tutoria of Scrapy. What's more, suppose that the content of the Web page is JavaScript generation. Then need a JS engine, PyV8 can take to try, and then there is a crawler based on JS. such as with Casperjs and Phantomjs.
"Address: http://blog.csdn.net/thisinnocence/article/details/39997883"
Python Crawler bulk Download American drama from everyone movie Hr-hdtv