How to get the crawler to crawl the article content of the webpage intelligently

Last Update:2016-06-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Related questions: How to extract the text of a Web page

Recently wrote a crawler to match the content of an article this is a bit of a hassle, each site to write regular.
1, how to realize the crawling of web article content intelligently? What do I need to do?
eg
Http://www.cnbeta.com/articles/385387.htm
http://www.ifanr.com/512005
2, how to extract the article after fetching the label? Recommended for similar articles later in the article.

Reply content:

Related questions: How to extract the text of a Web page

The first question and the question that has been repeated: how to identify and extract the text of the Web page?

The second question I wrote a simple word-of-phrase algorithm, and then I extracted the word with high frequency rate as the key word. Even a very simple algorithm, for the majority of the web effect is still not wrong.

The word algorithm has a lot of now, you can search;
Keyword extraction There are a lot of things that you can search ...

The second question seems to be repeating the problem.

I previously wrote a collection plugin in PHP. This is called the page body extract.

The algorithm is probably as follows:

1, decomposition of the Web page into a lot of Dom block.
2, decomposed Dom block. You have to use a certain standard to discard, exclude. For example, some DOM fast inside, are a lot of links this is generally a list. can be discarded. It also calculates the ratio of the text density (text/html). such as the percentage of labels (span,p,a,font). And so on, after multiple filtering, there will eventually be a few dom blocks left. Then filter out according to certain rules. The correct rate will be higher.

The most important value can also be used as a reference, I see in a paper, with a period of time in a text to determine how much.
If there are a lot of periods in a large piece of text, then this DOM is likely to be fast content.

I have written a Java version of the crawler (Gworm), put forward a little humble opinion, if you can give you a URL, the intelligent extraction of the article part of the webpage is still very difficult (also not no way, to use the statistical probability method, can not be absolutely correct). So my previous plan is to use CSS selectors to extract content, rather than handwritten regular expressions, a Web site's CSS style name is generally very stable, so that all the articles of a site only need an extraction rule, and your second question, get the article tag, using CSS selector can also be easily solved. I don't know what libraries in Python can provide CSS selection for DOM, but I do believe there must be some that correspond to me using the Java version of the CSS selector, which is jsoup.

Update: Just google a bit of "Python CSS selector" a lot of results. Look at this article, https://pythonhosted.org/cssselect/.

There are pyquery in Python
PHP has Phpquery
Are easy to handle with jquery syntax,

Python has scrapy framework, very good, there is a scrapinghub cloud platform, can save you a lot of work;

As for the fetch tag, it involves the classification and clustering algorithm, there are a lot of choices in this area

If the amount of data is small, use readability API to save trouble.

Https://www.readability.com/developers/api/parser

It is recommended not to use regular to do HTML parsing, learning about lxml, and then in the Chrome browser development This mode can copy the corresponding DOM node XPath directly in the lxml, save a lot of things, and lxml parsing html,xml performance leverage



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to get the crawler to crawl the article content of the webpage intelligently

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

How to get the crawler to crawl the article content of the webpage intelligently

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support