Python multi-thread crawler crawling movie heaven resources

Source: Internet
Author: User
This article mainly describes how to use Python multi-thread crawler to crawl movie heaven resources. if you need it, you can refer to it and spend some time learning Python, I also wrote a multi-threaded crawler program to get the thunder of movie heaven resources. the code has been uploaded to GitHub and can be downloaded by anyone who needs it. At the beginning, I hope to have some valuable comments.

Let's briefly introduce the basic implementation principles of web crawlers. A crawler first needs to give it a starting point, so we need to carefully select some URLs as the starting point. then our crawler starts from these starting points and crawls and parses the captured pages, extract the required information and insert the new URL to the queue as the starting point for the next crawling. This keeps repeating until the task of getting all the information crawlers you want is over. Let's take a look at the image.

# Parse the homepage def CrawIndexPage (starturl): print "crawling the homepage" page = _ getpage (starturl) if page = "error": returnpage = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// p [@ id = 'menu '] // a") print "homepage resolution address", len (Nodes), "entries" for node in Nodes: crawledURLs = [] CrawledURLs. append (starturl) url = node. xpath ("@ href") [0] if re. match (r'/html/[A-Za-z0-9 _/] +/index.html ', url): if _ isexit (host + url, CrawledURLs): passelse: try: catalog = node. xpath ("text ()") [0]. encode ("UTF-8") newdir = "E:/movie resource/" + catalogos. makedirs (newdir. decode ("UTF-8") print "classification directory created successfully ------" + newdirthread = myThread (host + url, newdir, crawler URLs) thread. start () begin T: pass

In this function, first download the source code of the web page and parse the menu category information through XPath. And create a file directory. There is a need to pay attention to the encoding problem, but it is also entangled for a long time. by viewing the source code of the webpage, we can find that the webpage encoding uses GB2312, here, constructing Tree objects through XPath requires decoding of text information and converting gb2312 into Unicode encoding. in this way, the DOM Tree structure is correct, otherwise problems may occur during subsequent parsing.

  ② Parse the homepage of each category

# Parsing classification file def CrawListPage (indexurl, filedir, CrawledURLs): print "parsing classification home page resources" print indexurlpage = _ getpage (indexurl) if page = "error ": returnCrawledURLs. append (indexurl) page = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// p [@ class = 'co _ content8'] // a") for node in Nodes: url = node. xpath ("@ href") [0] if re. match (r'/', url): # The Video Resource address if _ isexit (host + url, crawler url) can be parsed from a non-paging address: passe Lse: # The file name cannot contain the following special symbol filename = node. xpath ("text ()") [0]. encode ("UTF-8 "). replace ("/","")\. replace ("\\","")\. replace (":","")\. replace ("*","")\. replace ("? ","")\. Replace ("\"","")\. replace ("<","")\. replace ("> ","")\. replace ("|", "") CrawlSourcePage (host + url, filedir, filename, CrawledURLs) passelse: # pagination address nesting re-parsing print "pagination address nesting re-parsing ", urlindex = indexurl. rfind ("/") baseurl = indexurl [0: index + 1] pageurl = baseurl + urlif _ isexit (pageurl, CrawledURLs): passelse: print "pageurl nesting from it for re-resolution", pageurlCrawListPage (pageurl, filedir, CrawledURLs) passpass

When you open the homepage of each category, you will find that there is an identical structure (click to open the example). First, the node containing the resource URL is parsed and the name and URL are extracted. This part has two points to note. First, it is necessary to save the resource to a txt file, but some special symbols cannot appear during the naming process. Second, we must process pages. The data in the website is displayed by page. Therefore, it is important to identify and capture pages. It is observed that there is no "/" in front of the paging address, so you only need to find the paging address link through the regular expression, and then call the nested call to solve the paging problem.

③ Resolve the resource address and save it to the file.

# Process Resource page crawling resource address def crawler lsourcepage (url, filedir, filename, CrawledURLs): print urlpage = _ getpage (url) if page = "error": returnCrawledURLs. append (url) page = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// p [@ align = 'left'] // table // a") try: source = filedir + "/" + filename + ". txt "f = open (source. decode ("UTF-8"), 'w') for node in Nodes: sourceurl = node. xpath ("text ()") [0] f. write (sourc Eurl. encode ("UTF-8") + "\ n") f. close () failed t: print "!!!!!!!!!!!!!!!!! "

This section is relatively simple. just write the extracted content to a file.

In order to improve the program running efficiency, multithreading is used for crawling. here, I opened a thread for every category home page, which greatly accelerates the crawler efficiency. I thought it was just a single thread to run, and the result was waited for one afternoon. Finally, the result was not handled because of an exception. it was all run in vain all afternoon !!!! Heart fatigue

Class myThread (threading. thread): # inherit from the parent class threading. threaddef _ init _ (self, url, newdir, CrawledURLs): threading. thread. _ init _ (self) self. url = urlself. newdir = newdirself. crawledURLs = CrawledURLsdef run (self): # write the code to be executed into the run function. after the thread is created, it will directly run the run function CrawListPage (self. url, self. newdir, self. crawledURLs)

The above is only part of the code. all the code can be downloaded on GitHub (Click here to jump)

The final crawling result is as follows.

The above section describes how to use a Python multi-thread crawler to crawl movie heaven resources. I hope it will help you. if you have any questions, please leave a message, the editor will reply to you in a timely manner. I would like to thank you for your support for PHP chinnet!

For more articles about Python multi-thread crawler crawling movie heaven resources, refer to PHP Chinese network!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.