Problems related to crawler problem solving

Source: Internet
Author: User
Continuing with the previous article, the URL manager and the downloader have been written in the previous article. Next is the URL parser, in general This module is more difficult in several modules. Because after downloading the page through the downloader, we get the page, but this is not the result we want. And because the code of the page is so many, it's hard to find the data we want. Fortunately, we are downloading an HTML page, which is a text file of a tree structure composed of multiple layers of nodes. So, compared to TXT files, we are more likely to locate the data block we are looking for. Now all we have to do is go to the original page to analyze where the data we want is.

Open the Baidu Encyclopedia Pyton entry page, and then press F12 to bring up the developer tools. By using the tool, we can navigate to the content of the page:

So that we can find the label of the information we want.

 1 Import BS4 2 import re 3 from Urllib.parse import Urljoin 4 class Htmlparser (object): 5 "" "DocString for Htmlparser "" "6 def _get_new_urls (self, URL, soup): 7 New_urls = set () 8 links = soup.find_all (' a ', href = re.co Mpile (R '/item/. '))             9 for link in links:10 new_url = Re.sub (R ' (/item/) (. *) ', R ' \1%s '% link.gettext (), link[' href ']) 11     New_full_url = Urljoin (URL, new_url) new_urls.add (new_full_url) return NEW_URLS14 15 def _get_new_data (self, URL, soup): Res_data = {}17 #url18 res_data[' url '] = url19 #< DD class= "Lemmawgt-lemmatitle-title" >20 title_node = soup.find (' dd ', Class_ = "Lemmawgt-lemmatitle-title"). Find (' H1 ') res_data[' title ' = Title_node.gettext () #<div class= "Lemma-summary" label-module= "Lemmasumm ary ">23 summary_node = soup.find (' div ', Class_ =" lemma-summary ") res_data[' summary '] = Summary_node. GetteXT () return Res_data26 def parse (self, URL, html_cont): If URL was None or Html_cont is none:29 return soup = bs4. BeautifulSoup (Html_cont, ' lxml ') new_urls = self._get_new_urls (URL, soup) new_data = Self._get_new_dat A (URL, soup) return New_urls, New_data

Only one external method of the parser is the parse method,

A. First it will accept the URL, html_cont two parameters, and then determine whether the page content is empty

B. Call the BS4 module method to parse the Web page content, ' lxml ' for the document parser, the default for the HTML.PARSER,BS official recommendation we use lxml, then listen to it, who let people are official.

C. The next step is to call two intrinsic functions to get a new URL list and data

D. Finally return the URL list and data

Here are some points of attention

1.bs method call also has a parameter, from_encoding this and I repeat in the downloader there, so I canceled, two function is the same.

2. Get the internal method of the URL list, need to use regular expression, here I also stones, not very much, the middle also debugged many times.

3. The data is placed in a dictionary so that it can be changed by key to delete the data.

Best, the direct data output, this is relatively simple, directly on the code.

  

 1 class Htmloutputer (object): 2 "" "DocString for Htmloutputer" "" 3 def __init__ (self): 4 Self.datas = [] 5 def collect_data (self, new_data): 6 If New_data is None:7 return 8 self.datas.append (NE W_data) 9 def output_html (self): fout = open (' output1.html ', ' w ', encoding = ' utf-8 ') One fout.write (' 

There are also two points of attention here.

1.fout = open (' output1.html ', ' w ', encoding = ' utf-8 '), here the encoding parameter must be added, otherwise it will error, on the Windows platform, it is the default to use GBK encoding to write files.

2.fout.write ('

In conclusion, there are many aspects of this procedure that can be explored in depth:

  1. The amount of data on the page is too small, I try to crawl 10,000 pages. Once the amount of data has skyrocketed, there is a problem, first, the URL to be crawled and the crawled URL cannot be placed in the set collection, either in the Radi cache server or in the MySQL database

2. Second, the data is the same, the dictionary can not meet the need for a dedicated database to store

3. After the third volume up, the crawl efficiency is required, then the multi-threading will be added in

4. Four, once the task is set, the pressure on a single server will be too large, and once the outage, the risk is very large, so the distributed high-availability architecture to follow up

5. On the one hand, the content of the page is too simple, are static pages, do not involve login, also does not involve AJAX dynamic acquisition

6. This is just data collection, followed by modeling, analysis .....

To sum up, the road is far away, come on!

  

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.