Python multi-thread crawler crawling movie heaven resources

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This article mainly describes how to use Python multi-thread crawler to crawl movie heaven resources. if you need it, you can refer to it and spend some time learning Python, I also wrote a multi-threaded crawler program to get the thunder of movie heaven resources. the code has been uploaded to GitHub and can be downloaded by anyone who needs it. At the beginning, I hope to have some valuable comments.

Let's briefly introduce the basic implementation principles of web crawlers. A crawler first needs to give it a starting point, so we need to carefully select some URLs as the starting point. then our crawler starts from these starting points and crawls and parses the captured pages, extract the required information and insert the new URL to the queue as the starting point for the next crawling. This keeps repeating until the task of getting all the information crawlers you want is over. Let's take a look at the image.

# Parse the homepage def CrawIndexPage (starturl): print "crawling the homepage" page = _ getpage (starturl) if page = "error": returnpage = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// p [@ id = 'menu '] // a") print "homepage resolution address", len (Nodes), "entries" for node in Nodes: crawledURLs = [] CrawledURLs. append (starturl) url = node. xpath ("@ href") [0] if re. match (r'/html/[A-Za-z0-9 _/] +/index.html ', url): if _ isexit (host + url, CrawledURLs): passelse: try: catalog = node. xpath ("text ()") [0]. encode ("UTF-8") newdir = "E:/movie resource/" + catalogos. makedirs (newdir. decode ("UTF-8") print "classification directory created successfully ------" + newdirthread = myThread (host + url, newdir, crawler URLs) thread. start () begin T: pass

In this function, first download the source code of the web page and parse the menu category information through XPath. And create a file directory. There is a need to pay attention to the encoding problem, but it is also entangled for a long time. by viewing the source code of the webpage, we can find that the webpage encoding uses GB2312, here, constructing Tree objects through XPath requires decoding of text information and converting gb2312 into Unicode encoding. in this way, the DOM Tree structure is correct, otherwise problems may occur during subsequent parsing.

　　② Parse the homepage of each category

# Parsing classification file def CrawListPage (indexurl, filedir, CrawledURLs): print "parsing classification home page resources" print indexurlpage = _ getpage (indexurl) if page = "error ": returnCrawledURLs. append (indexurl) page = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// p [@ class = 'co _ content8'] // a") for node in Nodes: url = node. xpath ("@ href") [0] if re. match (r'/', url): # The Video Resource address if _ isexit (host + url, crawler url) can be parsed from a non-paging address: passe Lse: # The file name cannot contain the following special symbol filename = node. xpath ("text ()") [0]. encode ("UTF-8 "). replace ("/","")\. replace ("\\","")\. replace (":","")\. replace ("*","")\. replace ("? ","")\. Replace ("\"","")\. replace ("<","")\. replace ("> ","")\. replace ("|", "") CrawlSourcePage (host + url, filedir, filename, CrawledURLs) passelse: # pagination address nesting re-parsing print "pagination address nesting re-parsing ", urlindex = indexurl. rfind ("/") baseurl = indexurl [0: index + 1] pageurl = baseurl + urlif _ isexit (pageurl, CrawledURLs): passelse: print "pageurl nesting from it for re-resolution", pageurlCrawListPage (pageurl, filedir, CrawledURLs) passpass

When you open the homepage of each category, you will find that there is an identical structure (click to open the example). First, the node containing the resource URL is parsed and the name and URL are extracted. This part has two points to note. First, it is necessary to save the resource to a txt file, but some special symbols cannot appear during the naming process. Second, we must process pages. The data in the website is displayed by page. Therefore, it is important to identify and capture pages. It is observed that there is no "/" in front of the paging address, so you only need to find the paging address link through the regular expression, and then call the nested call to solve the paging problem.

③ Resolve the resource address and save it to the file.

# Process Resource page crawling resource address def crawler lsourcepage (url, filedir, filename, CrawledURLs): print urlpage = _ getpage (url) if page = "error": returnCrawledURLs. append (url) page = page. decode ('gbk', 'ignore') tree = etree. HTML (page) Nodes = tree. xpath ("// p [@ align = 'left'] // table // a") try: source = filedir + "/" + filename + ". txt "f = open (source. decode ("UTF-8"), 'w') for node in Nodes: sourceurl = node. xpath ("text ()") [0] f. write (sourc Eurl. encode ("UTF-8") + "\ n") f. close () failed t: print "!!!!!!!!!!!!!!!!! "

This section is relatively simple. just write the extracted content to a file.

In order to improve the program running efficiency, multithreading is used for crawling. here, I opened a thread for every category home page, which greatly accelerates the crawler efficiency. I thought it was just a single thread to run, and the result was waited for one afternoon. Finally, the result was not handled because of an exception. it was all run in vain all afternoon !!!! Heart fatigue

Class myThread (threading. thread): # inherit from the parent class threading. threaddef _ init _ (self, url, newdir, CrawledURLs): threading. thread. _ init _ (self) self. url = urlself. newdir = newdirself. crawledURLs = CrawledURLsdef run (self): # write the code to be executed into the run function. after the thread is created, it will directly run the run function CrawListPage (self. url, self. newdir, self. crawledURLs)

The above is only part of the code. all the code can be downloaded on GitHub (Click here to jump)

The final crawling result is as follows.

The above section describes how to use a Python multi-thread crawler to crawl movie heaven resources. I hope it will help you. if you have any questions, please leave a message, the editor will reply to you in a timely manner. I would like to thank you for your support for PHP chinnet!

For more articles about Python multi-thread crawler crawling movie heaven resources, refer to PHP Chinese network!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python multi-thread crawler crawling movie heaven resources

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python multi-thread crawler crawling movie heaven resources

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support