Recently took some time to learn python, and wrote a multi-threaded crawler to get the movie heaven on the resources of the thunder, the code has been uploaded to github, the students need to download their own. Just start learning python and hope to get valuable advice.
Let's start with a brief introduction to the basic principles of web crawler Implementation. A crawler first to give it a starting point, so you need to carefully select some URLs as the starting point, and then our crawler from these starting point, crawl and parse the crawled page, extract the information needed, while the new URL is inserted into the queue as the starting point for the next crawl. This cycle continues until the end of the task of getting all the information crawlers you want to get. Let's look through a picture.
OK below to get to the point, to explain the implementation of the next Procedure.
First of all to analyze the movie Paradise website homepage structure.
From the menu bar above we can see the overall classification of the entire site Resources. Just fine. we can use this classification of it, each of the classified address as the starting point of the Crawler.
① Parse Home Address extract classification information
#Parse Homedefcrawindexpage (starturl):Print "Crawling home Page"page=__getpage(starturl)ifpage=="Error": returnpage= Page.decode ('GBK','Ignore') Tree=Etree. HTML (page) Nodes= Tree.xpath ("//div[@id = ' Menu ']//a") Print "Home parse out address", Len (Nodes),"Strip" forNodeinchNodes:crawledurls=[] crawledurls.append (starturl) URL=node.xpath ("@href") [0]ifRe.match (r'/html/[a-za-z0-9_/]+/index.html', url):if __isexit(host +url,crawledurls):Pass Else: Try: Catalog= Node.xpath ("text ()") [0].encode ("Utf-8") Newdir="E:/movie resources/"+Catalog Os.makedirs (newdir.decode ("Utf-8")) Print "Create catalog successfully------"+newdir Thread= MyThread (host +url, newdir,crawledurls) Thread.Start ()except: Pass
In this function, the source code of the Web page is downloaded, and the menu classification information is parsed through Xpath. and create the appropriate file Directory. There is a need to pay attention to the coding problem, but it is also entangled in this code for a long time, by looking at the source code of the Web page, we can find that the page encoding is GB2312, here through the XPath constructs the tree object is required to decode the text information, Turn the gb2312 into Unicode encoding so that the DOM tree structure is correct, or else there will be a problem when parsing later.
② parsing the home page for each category
#parsing a classification filedefcrawlistpage (indexurl,filedir,crawledurls):Print "Parsing category Home Resources" PrintIndexurl Page=__getpage(indexurl)ifpage=="Error": returncrawledurls.append (indexurl) Page= Page.decode ('GBK','Ignore') Tree=Etree. HTML (page) Nodes= Tree.xpath ("//div[@class = ' co_content8 ']//a") forNodeinchNodes:url=node.xpath ("@href") [0]ifRe.match (r'/', url):#A non-paged address from which the video resource address can be parsed if __isexit(host +url,crawledurls):Pass Else: #file naming is not possible with the following special symbolsFilename=node.xpath ("text ()") [0].encode ("Utf-8"). Replace ("/"," "). Replace ("\\"," "). Replace (":"," "). Replace ("*"," "). Replace ("?"," "). Replace ("\""," "). Replace ("<"," "). Replace (">"," "). Replace ("|"," ") Crawlsourcepage (host+Url,filedir,filename,crawledurls)Pass Else: #the paging address is parsed from the nested Print "the paging address is parsed from the nested", URL index= Indexurl.rfind ("/") BaseURL= Indexurl[0:index + 1] Pageurl= BaseURL +URLif __isexit(pageurl,crawledurls):Pass Else: Print "the paging address is parsed from the nested", Pageurl crawlistpage (pageurl,filedir,crawledurls)Pass Pass
Opening the first page of each category will find that there is an identical structure (click open Example) First resolve the node that contains the URL of the resource, and then extract the name and Url. There are two areas to note in this Section. One is because eventually you want to save the resources in a txt file, but there are no special symbols to be named, so we need to dispose of them. second, it is necessary to deal with the paging, the data in the site is displayed through the form of pagination, so how to identify and crawl paging is also very important. By observing that there is no "/" in front of the address of the page, it is only necessary to find the paging address link through the regular expression, and then nested the call to resolve the paging Problem.
③ parsing the resource address to a file
#Process Resource Page Crawl resource addressdefcrawlsourcepage (url,filedir,filename,crawledurls):PrintURL Page=__getpage(url)ifpage=="Error": returncrawledurls.append (url) Page= Page.decode ('GBK','Ignore') Tree=Etree. HTML (page) Nodes= Tree.xpath ("//div[@align = ' Left ']//table//a") Try: Source= Filedir +"/"+ filename +". txt"F= Open (source.decode ("Utf-8"),'W') forNodeinchNodes:sourceurl= Node.xpath ("text ()") [0] f.write (sourceurl.encode ("Utf-8")+"\ n") F.close ()except: Print "!!!!!!!!!!!!!!!!!"
This paragraph is relatively simple, to write the extracted content into a file on the line
In order to improve the efficiency of the program, using a multi-threaded crawl, here I am for each of the classification of the homepage has opened a thread, which greatly accelerates the efficiency of the Crawler. Think at the beginning, just with a single-threaded to run, the result of an afternoon, and finally because an exception did not deal with the results of an afternoon white ran!!!! Heart tired
class MyThread (threading. Thread): # inherits the parent class Threading. Thread def__init__(self, url, newdir,crawledurls): threading. Thread. __init__ (self) = URL = newdir self . Crawledurls=crawledurls def run (self): # Write the code you want to execute into the run function After the thread is created, the Run function crawlistpage (self.url, Self.newdir,self) runs directly. Crawledurls)
The above is only part of the code, all the code can go to github download (point i Jump)
The result of the last crawl is as Follows.
Python multi-threaded crawler crawls movie Paradise Resources