Python multi-threaded crawler crawls movie Paradise Resources

Last Update:2016-09-17 Source: Internet

Author: User

Tags xpath

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently took some time to learn python, and wrote a multi-threaded crawler to get the movie heaven on the resources of the thunder, the code has been uploaded to github, the students need to download their own. Just start learning python and hope to get valuable advice.

Let's start with a brief introduction to the basic principles of web crawler Implementation. A crawler first to give it a starting point, so you need to carefully select some URLs as the starting point, and then our crawler from these starting point, crawl and parse the crawled page, extract the information needed, while the new URL is inserted into the queue as the starting point for the next crawl. This cycle continues until the end of the task of getting all the information crawlers you want to get. Let's look through a picture.

OK below to get to the point, to explain the implementation of the next Procedure.

First of all to analyze the movie Paradise website homepage structure.

From the menu bar above we can see the overall classification of the entire site Resources. Just fine. we can use this classification of it, each of the classified address as the starting point of the Crawler.

① Parse Home Address extract classification information

#Parse Homedefcrawindexpage (starturl):Print "Crawling home Page"page=__getpage(starturl)ifpage=="Error":        returnpage= Page.decode ('GBK','Ignore') Tree=Etree. HTML (page) Nodes= Tree.xpath ("//div[@id = ' Menu ']//a")    Print "Home parse out address", Len (Nodes),"Strip"     forNodeinchNodes:crawledurls=[] crawledurls.append (starturl) URL=node.xpath ("@href") [0]ifRe.match (r'/html/[a-za-z0-9_/]+/index.html', url):if __isexit(host +url,crawledurls):Pass            Else:                Try: Catalog= Node.xpath ("text ()") [0].encode ("Utf-8") Newdir="E:/movie resources/"+Catalog Os.makedirs (newdir.decode ("Utf-8"))                    Print "Create catalog successfully------"+newdir Thread= MyThread (host +url, newdir,crawledurls) Thread.Start ()except:                    Pass

In this function, the source code of the Web page is downloaded, and the menu classification information is parsed through Xpath. and create the appropriate file Directory. There is a need to pay attention to the coding problem, but it is also entangled in this code for a long time, by looking at the source code of the Web page, we can find that the page encoding is GB2312, here through the XPath constructs the tree object is required to decode the text information, Turn the gb2312 into Unicode encoding so that the DOM tree structure is correct, or else there will be a problem when parsing later.

② parsing the home page for each category

#parsing a classification filedefcrawlistpage (indexurl,filedir,crawledurls):Print "Parsing category Home Resources"    PrintIndexurl Page=__getpage(indexurl)ifpage=="Error":        returncrawledurls.append (indexurl) Page= Page.decode ('GBK','Ignore') Tree=Etree. HTML (page) Nodes= Tree.xpath ("//div[@class = ' co_content8 ']//a")     forNodeinchNodes:url=node.xpath ("@href") [0]ifRe.match (r'/', url):#A non-paged address from which the video resource address can be parsed            if __isexit(host +url,crawledurls):Pass            Else:                #file naming is not possible with the following special symbolsFilename=node.xpath ("text ()") [0].encode ("Utf-8"). Replace ("/"," "). Replace ("\\"," "). Replace (":"," "). Replace ("*"," "). Replace ("?"," "). Replace ("\""," "). Replace ("<"," "). Replace (">"," "). Replace ("|"," ") Crawlsourcepage (host+Url,filedir,filename,crawledurls)Pass        Else:            #the paging address is parsed from the nested            Print "the paging address is parsed from the nested", URL index= Indexurl.rfind ("/") BaseURL= Indexurl[0:index + 1] Pageurl= BaseURL +URLif __isexit(pageurl,crawledurls):Pass            Else:                Print "the paging address is parsed from the nested", Pageurl crawlistpage (pageurl,filedir,crawledurls)Pass    Pass

Opening the first page of each category will find that there is an identical structure (click open Example) First resolve the node that contains the URL of the resource, and then extract the name and Url. There are two areas to note in this Section. One is because eventually you want to save the resources in a txt file, but there are no special symbols to be named, so we need to dispose of them. second, it is necessary to deal with the paging, the data in the site is displayed through the form of pagination, so how to identify and crawl paging is also very important. By observing that there is no "/" in front of the address of the page, it is only necessary to find the paging address link through the regular expression, and then nested the call to resolve the paging Problem.

③ parsing the resource address to a file

#Process Resource Page Crawl resource addressdefcrawlsourcepage (url,filedir,filename,crawledurls):PrintURL Page=__getpage(url)ifpage=="Error":        returncrawledurls.append (url) Page= Page.decode ('GBK','Ignore') Tree=Etree. HTML (page) Nodes= Tree.xpath ("//div[@align = ' Left ']//table//a")    Try: Source= Filedir +"/"+ filename +". txt"F= Open (source.decode ("Utf-8"),'W')         forNodeinchNodes:sourceurl= Node.xpath ("text ()") [0] f.write (sourceurl.encode ("Utf-8")+"\ n") F.close ()except:        Print "!!!!!!!!!!!!!!!!!"

This paragraph is relatively simple, to write the extracted content into a file on the line

In order to improve the efficiency of the program, using a multi-threaded crawl, here I am for each of the classification of the homepage has opened a thread, which greatly accelerates the efficiency of the Crawler. Think at the beginning, just with a single-threaded to run, the result of an afternoon, and finally because an exception did not deal with the results of an afternoon white ran!!!! Heart tired

class MyThread (threading. Thread):   # inherits the parent class Threading. Thread    def__init__(self, url, newdir,crawledurls):        threading. Thread. __init__ (self)         = URL        = newdir self        . Crawledurls=crawledurls    def run (self):                   # Write the code you want to execute into the run function After the thread is created, the Run function        crawlistpage (self.url, Self.newdir,self) runs directly. Crawledurls)

The above is only part of the code, all the code can go to github download (point i Jump)

The result of the last crawl is as Follows.

Python multi-threaded crawler crawls movie Paradise Resources

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More