Use Python multi-thread crawler to crawl movie heaven resources, and python multi-thread

Source: Internet
Author: User

Use Python multi-thread crawler to crawl movie heaven resources, and python multi-thread

I recently spent some time learning Python and wrote a multi-thread crawler program to get the thunder of movie heaven resources. The Code has been uploaded to GitHub, you can download the required files by yourself. At the beginning, I hope to have some valuable comments.

Let's briefly introduce the basic implementation principles of web crawlers. A crawler first needs to give it a starting point, so we need to carefully select some URLs as the starting point. Then our crawler starts from these starting points and crawls and parses the captured pages, extract the required information and insert the new URL to the queue as the starting point for the next crawling. This keeps repeating until the task of getting all the information crawlers you want is over. Let's take a look at the image.

Okay. Next, let's get started with the program implementation.

First, analyze the homepage structure of the movie heaven website.

From the top menu bar, we can see the overall category of the entire website resource. We can use this classification to take each classification address as the starting point of the crawler.

  ① Parse the home address and extract the category Information

# Parse the homepage def CrawIndexPage (starturl): print "crawling the homepage" page = _ getpage (starturl) if page = "error": returnpage = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// div [@ id = 'menu '] // a") print "homepage resolution address", len (Nodes), "entries" for node in Nodes: crawledURLs = [] CrawledURLs. append (starturl) url = node. xpath ("@ href") [0] if re. match (R'/html/[A-Za-z0-9 _/] +/index.html ', url): if _ isexit (host + url, CrawledURLs): passelse: try: catalog = node. xpath ("text ()") [0]. encode ("UTF-8") newdir = "E:/movie resource/" + catalogos. makedirs (newdir. decode ("UTF-8") print "Classification directory created successfully ------" + newdirthread = myThread (host + url, newdir, crawler URLs) thread. start () Begin T: pass

In this function, first download the source code of the web page and parse the menu category information through XPath. And create a file directory. There is a need to pay attention to the encoding problem, but it is also entangled for a long time. By viewing the source code of the webpage, we can find that the webpage encoding uses GB2312, here, constructing Tree objects through XPath requires Decoding of text information and converting gb2312 into Unicode encoding. In this way, the DOM Tree structure is correct, otherwise problems may occur during subsequent parsing.

  ② Parse the homepage of each category

# Parsing classification file def CrawListPage (indexurl, filedir, CrawledURLs): print "parsing Classification Home page resources" print indexurlpage = _ getpage (indexurl) if page = "error ": returnCrawledURLs. append (indexurl) page = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// div [@ class = 'co _ content8'] // a") for node in Nodes: url = node. xpath ("@ href") [0] if re. match (R'/', url): # The video resource address if _ isexit (host + url, crawler url) can be parsed from a non-Paging address: pas Selse: # the file name cannot contain the following special symbol filename = node. xpath ("text ()") [0]. encode ("UTF-8 "). replace ("/","")\. replace ("\\","")\. replace (":","")\. replace ("*","")\. replace ("? ","")\. Replace ("\"","")\. replace ("<","")\. replace ("> ","")\. replace ("|", "") CrawlSourcePage (host + url, filedir, filename, CrawledURLs) passelse: # pagination address nesting re-parsing print "pagination address nesting re-parsing ", urlindex = indexurl. rfind ("/") baseurl = indexurl [0: index + 1] pageurl = baseurl + urlif _ isexit (pageurl, CrawledURLs): passelse: print "pageurl nesting from it for re-resolution", pageurlCrawListPage (pageurl, filedir, CrawledURLs) passpass

When you open the homepage of each category, you will find that there is an identical structure (click to open the example). First, the node containing the resource URL is parsed and the name and URL are extracted. This part has two points to note. First, it is necessary to save the resource to a txt file, but some special symbols cannot appear during the naming process. Second, we must process pages. The data in the website is displayed by page. Therefore, it is important to identify and capture pages. It is observed that there is no "/" in front of the paging address, so you only need to find the paging Address link through the regular expression, and then call the nested call to solve the paging problem.

③ Resolve the resource address and save it to the file.

# Process resource page crawling resource address def crawler lsourcepage (url, filedir, filename, CrawledURLs): print urlpage = _ getpage (url) if page = "error": returnCrawledURLs. append (url) page = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// div [@ align = 'left'] // table // a") try: source = filedir + "/" + filename + ". txt "f = open (source. decode ("UTF-8"), 'w') for node in Nodes: sourceurl = node. xpath ("text ()") [0] f. write (sou Rceurl. encode ("UTF-8") + "\ n") f. close () failed T: print "!!!!!!!!!!!!!!!!! "

This section is relatively simple. Just write the extracted content to a file.

In order to improve the program running efficiency, multithreading is used for crawling. Here, I opened a thread for every category home page, which greatly accelerates the crawler efficiency. I thought it was just a single thread to run, and the result was waited for one afternoon. Finally, the result was not handled because of an exception. It was all run in vain all afternoon !!!! Heart Fatigue

Class myThread (threading. thread): # inherit from the parent class threading. threaddef _ init _ (self, url, newdir, CrawledURLs): threading. thread. _ init _ (self) self. url = urlself. newdir = newdirself. crawledURLs = CrawledURLsdef run (self): # write the code to be executed into the run function. After the thread is created, it will directly run the run function CrawListPage (self. url, self. newdir, self. crawledURLs)

The above is only part of the code. All the code can be downloaded on GitHub (Click here to jump)

The final crawling result is as follows.

The above section describes how to use a Python multi-thread crawler to crawl movie heaven resources. I hope it will help you. If you have any questions, please leave a message, the editor will reply to you in a timely manner. Thank you very much for your support for the help House website!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.