Problems related to crawler problem solving

Last Update:2017-06-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Continuing with the previous article, the URL manager and the downloader have been written in the previous article. Next is the URL parser, in general This module is more difficult in several modules. Because after downloading the page through the downloader, we get the page, but this is not the result we want. And because the code of the page is so many, it's hard to find the data we want. Fortunately, we are downloading an HTML page, which is a text file of a tree structure composed of multiple layers of nodes. So, compared to TXT files, we are more likely to locate the data block we are looking for. Now all we have to do is go to the original page to analyze where the data we want is.

Open the Baidu Encyclopedia Pyton entry page, and then press F12 to bring up the developer tools. By using the tool, we can navigate to the content of the page:

So that we can find the label of the information we want.

 1 Import BS4 2 import re 3 from Urllib.parse import Urljoin 4 class Htmlparser (object): 5 "" "DocString for Htmlparser "" "6 def _get_new_urls (self, URL, soup): 7 New_urls = set () 8 links = soup.find_all (' a ', href = re.co Mpile (R '/item/. '))             9 for link in links:10 new_url = Re.sub (R ' (/item/) (. *) ', R ' \1%s '% link.gettext (), link[' href ']) 11     New_full_url = Urljoin (URL, new_url) new_urls.add (new_full_url) return NEW_URLS14 15 def _get_new_data (self, URL, soup): Res_data = {}17 #url18 res_data[' url '] = url19 #< DD class= "Lemmawgt-lemmatitle-title" >20 title_node = soup.find (' dd ', Class_ = "Lemmawgt-lemmatitle-title"). Find (' H1 ') res_data[' title ' = Title_node.gettext () #<div class= "Lemma-summary" label-module= "Lemmasumm ary ">23 summary_node = soup.find (' div ', Class_ =" lemma-summary ") res_data[' summary '] = Summary_node. GetteXT () return Res_data26 def parse (self, URL, html_cont): If URL was None or Html_cont is none:29 return soup = bs4. BeautifulSoup (Html_cont, ' lxml ') new_urls = self._get_new_urls (URL, soup) new_data = Self._get_new_dat A (URL, soup) return New_urls, New_data

Only one external method of the parser is the parse method,

A. First it will accept the URL, html_cont two parameters, and then determine whether the page content is empty

B. Call the BS4 module method to parse the Web page content, ' lxml ' for the document parser, the default for the HTML.PARSER,BS official recommendation we use lxml, then listen to it, who let people are official.

C. The next step is to call two intrinsic functions to get a new URL list and data

D. Finally return the URL list and data

Here are some points of attention

1.bs method call also has a parameter, from_encoding this and I repeat in the downloader there, so I canceled, two function is the same.

2. Get the internal method of the URL list, need to use regular expression, here I also stones, not very much, the middle also debugged many times.

3. The data is placed in a dictionary so that it can be changed by key to delete the data.

Best, the direct data output, this is relatively simple, directly on the code.

 1 class Htmloutputer (object): 2 "" "DocString for Htmloutputer" "" 3 def __init__ (self): 4 Self.datas = [] 5 def collect_data (self, new_data): 6 If New_data is None:7 return 8 self.datas.append (NE W_data) 9 def output_html (self): fout = open (' output1.html ', ' w ', encoding = ' utf-8 ') One fout.write (' 
There are also two points of attention here.

1.fout = open (' output1.html ', ' w ', encoding = ' utf-8 '), here the encoding parameter must be added, otherwise it will error, on the Windows platform, it is the default to use GBK encoding to write files.

2.fout.write (' 

In conclusion, there are many aspects of this procedure that can be explored in depth:

　　1. The amount of data on the page is too small, I try to crawl 10,000 pages. Once the amount of data has skyrocketed, there is a problem, first, the URL to be crawled and the crawled URL cannot be placed in the set collection, either in the Radi cache server or in the MySQL database

2. Second, the data is the same, the dictionary can not meet the need for a dedicated database to store

3. After the third volume up, the crawl efficiency is required, then the multi-threading will be added in

4. Four, once the task is set, the pressure on a single server will be too large, and once the outage, the risk is very large, so the distributed high-availability architecture to follow up

5. On the one hand, the content of the page is too simple, are static pages, do not involve login, also does not involve AJAX dynamic acquisition

6. This is just data collection, followed by modeling, analysis .....

To sum up, the road is far away, come on!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Problems related to crawler problem solving

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Problems related to crawler problem solving

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support