How can crawlers intelligently crawl the content of web pages?

Last Update:2018-05-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Related question: it is a little difficult to extract the body of a website and recently write a crawler that uses regular expressions to match the content of an article. Every website must write regular expressions. 1. How can I intelligently crawl web articles? What should I do? Eg: www. cnbeta. comarticles0000387... related question: how to extract the webpage text

Recently I wrote a crawler that uses regular expressions to match the content of an article. This is a little troublesome. Every website must write regular expressions.
1. How can I intelligently crawl web articles? What should I do?
Eg:
Http://www.cnbeta.com/articles/385387.htm
Http://www.ifanr.com/512005
2. How can I extract the tag of an article after capturing it? Used for later recommendation in similar articles.

Reply content:

Related question: how to extract the webpage text

The first and existing problems are important: how to identify and extract the webpage text ?.

The second problem is that I used the word segmentation algorithm to extract words with a high rate. Even a very simple algorithm does not have much effect on most data pages.

However, you can search for many word segmentation algorithms;
There are many useful words to be extracted. You can search for them...

The second question seems to be more important than the existing one.

I used php to write a collection plug-in. You are called webpage body extraction.

The algorithm is roughly as follows:

1. Break down a webpage into many DOM blocks.
2. Decomposed dom blocks. You must discard and exclude them with certain standards. For example, in some dom files, there are a bunch of links. This is generally a list. Can be discarded. The ratio of text density (text/html) is also calculated. For example, the percentage of tags (span, p, a, font. Wait. After multiple filters, a few dom blocks will be left. Then filter out according to certain rules. The accuracy is relatively high.

The most important value can also be used as a reference. I can see in a thesis that the number of periods in a piece of text is used to determine the number of periods.
If there are a lot of periods in a large text, it is likely that the dom content is fast.

I have previously written a java-version crawler (Gworm). I 'd like to give you an example, it is still very difficult to extract the article part of the web page intelligently (it is not the best way to do it. The method of statistical probability cannot be correct ). Therefore, my previous solution was to use the css selector to extract the content without manually writing regular expressions. the css style names of a website are generally stable, in this way, only one extraction rule is required for all articles on a website. In addition, you can easily obtain the article tag and use the css selector to solve the second problem. When the subject crawls using python, I don't know which library of python can provide the css selection function for DOM, but I believe there must be, the css selector for java is Jsoup.

Update: Just google the "python css selector" handler to get the result. Please refer to the following link for more information: https://pythonhosted.org/cssselect /.

Python has pyquery
Php has phpquery
It is very convenient to use jquery syntax for processing,

Python has the scrapy framework, which is good. There is also a scrapinghub cloud platform that can save a lot of your work;

As for tag capturing, classification and clustering algorithms are involved, and there are many choices.

If the data volume is small, use the readability api to save trouble.

Https://www.readability.com/developers/api/parser

We recommend that you do not use regular expressions for html Parsing. Learn about lxml. Then, in the Development Mode of chrome browser, you can copy the xpath of the corresponding DOM node directly in lxml, saving a lot of trouble, in addition, lxml parses html and xml performance.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How can crawlers intelligently crawl the content of web pages?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How can crawlers intelligently crawl the content of web pages?

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support