Recently spent some time learning python, and wrote a multithreaded crawler to get the movie paradise resources of the Thunder download address, the code has been uploaded to the GitHub, the need of students can download themselves. Just beginning to learn Python is hoping to get valuable advice.
First of all, a brief introduction to the Network Crawler Basic principles of implementation. A reptile must first give it a starting point, so we need to carefully select some URL as a starting point, and then our crawler from these starting point, crawl and parse the crawled page, the information needed to extract, and the new URL to get inserted into the queue as the next crawl starting point. This cycle continues until the task of getting all the information you want to get is over. Let's look through a picture.
Okay, let's go to the bottom of the program to explain the implementation of the procedure.
First of all, we should analyze the homepage structure of the movie Paradise website.
From the menu bar above, we can see the overall classification of the entire site resources. Just fine we can use this sort of it to classify each address as a starting point for the crawler.
① resolution Home Address extraction category information
#解析首页
def crawindexpage (starturl):
print "Crawling home page"
= __getpage (starturl)
if page== "error":
return
page = Page.decode (' GBK ', ' ignore ') tree
= etree. HTML (page)
Nodes = Tree.xpath ("//div[@id = ' menu ']//a")
print "Home resolves address", Len (Nodes), "Bar" for
node in Nodes:
crawledurls = []
Crawledurls.append (StartURL)
url=node.xpath ("@href") [0]
if Re.match (R '/html/[a-za-z0-9_/]+/index.html ', url]:
if __isexit (Host + url,crawledurls):
pass
else:
try:
Catalog = Node.xpath ("Text ()" ) [0].encode ("Utf-8")
newdir = "e:/Movie resource/" + catalog
os.makedirs (Newdir.decode ("Utf-8"))
print " Create catalog successfully------"+newdir
thread = mythread (host + URL, newdir,crawledurls)
Thread.Start ()
except:
In this function, we first download the source code of the Web page and parse out the menu classification information through XPath. and create the appropriate file directory. There is a need to note that the coding problem, but also been entangled by this code for a long time, by looking at the source code of the Web page, we can find that the code of the Web page is GB2312, where the tree object through XPath is required to decode the text information, Turn gb2312 into Unicode encoding so that the DOM tree structure is correct, or there will be a problem when parsing later.
② resolves the home page for each category
# Parsing classification File Def crawlistpage (indexurl,filedir,crawledurls): print "parsing category Master resource" Print Indexurl page = __getpage (indexurl) if page== "error": Return crawledurls.append (indexurl) page = Page.decode (' GBK ', ' ignore ') tree = etree. HTML (page) Nodes = Tree.xpath ("//div[@class = ' co_content8 ']//a") for node in Nodes:url=node.xpath ("@href") [0] If Re.match (R '/', URL): # The Non-paged address can resolve the video resource address from the If __isexit (Host + url,crawledurls): Pass else: #文件命名是不能出现以下特殊符号 Filename=node. XPath ("text ()") [0].encode ("Utf-8"). Replace ("/", "") \. Replace ("\", "") \. Replace (":", "") \. Replace ("*", "") \. Replace ("?", "") \. Replace ("\", "") \. Replace ("<", "") \. Replace (">", "") \. Replace ("|", "") Crawlsourcepage (ho St + Url,filedir,filename,crawledurls) pass else: # paging addresses are nested from inside again parsing print "paging address from nested again", URL index = indexurl.rfind ("/") BA Seurl = Indexurl[0:index + 1] pageurl = baseurl + URL if __isexit (pageurl,crawledurls): Pass Else:print "paging address from nested to parse again", Pageurl Crawlistpage (pageurl,filedir,crawledurls) Pass
Open each category of the first page will find that there is a similar structure (click to open the example) first parse out the node containing the resource URL, and then extract the name and URL. There are two areas in this section that need attention. One is because the final want to save the resources to a TXT file, but the name can not appear some special symbols, so need to dispose of. Second, it is necessary to deal with the pagination, the data in the Web site through the form of pagination, so how to identify and crawl page is also very important. By observing that there is no "/" in front of the address of the paging, you only need to find the paging address link through the regular expression, and then the nested call resolves the paging problem.
③ Resolve resource Address to file
#处理资源页面 Crawl Resource Address
def crawlsourcepage (url,filedir,filename,crawledurls):
print URL
page = __getpage (URL)
if page== "error":
return
crawledurls.append (URL)
page = Page.decode (' GBK ', ' ignore ')
Etree. HTML (page)
Nodes = Tree.xpath ("//div[@align = ' left ']//table//a")
try:
Source = Filedir + "/" + filename + ". txt"
f = open (Source.decode ("Utf-8"), ' W ') for
node in Nodes:
sourceURL = Node.xpath ("text ()") [0]
F.write (Sourceurl.encode ("utf-8") + "\ n")
F.close ()
except:
This paragraph is relatively simple, the extracted content will be written in a file on the line
In order to improve the efficiency of the program, using multithreading to crawl, here I am for each of the classification of the homepage has opened a thread, so greatly accelerate the efficiency of the crawler. Think originally, just use a single thread to run, the result waited for an afternoon finally because an exception did not deal with the result of all the afternoon white ran!!!! Heart tired
Class Mythread (threading. Thread): #继承父类threading. Thread
def __init__ (self, URL, newdir,crawledurls):
Threading. Thread.__init__ (self)
self.url = URL
self.newdir = Newdir
self. Crawledurls=crawledurls
def run (self): #把要执行的代码写到run函数里面 thread runs the run function directly after it is created
The above is only part of the code, all the code can go to the github above to download (point I jump)
The results of the final crawl are as follows.
The above is a small series to introduce the use of Python multithreaded crawler crawling movie Paradise Resources, I hope to help you, if you have any questions please give me a message, small series will promptly reply to everyone. Here also thank you very much for the cloud Habitat Community website support!