Related questions: How to extract the text of a Web page
Recently wrote a crawler to match the content of an article this is a bit of a hassle, each site to write regular.
1, how to realize the crawling of web article content intelligently? What do I need to do?
eg
Http://www.cnbeta.com/articles/385387.htm
http://www.ifanr.com/512005
2, how to extract the article after fetching the label? Recommended for similar articles later in the article.
Reply content:
Related questions: How to extract the text of a Web page
Recently wrote a crawler to match the content of an article this is a bit of a hassle, each site to write regular.
1, how to realize the crawling of web article content intelligently? What do I need to do?
eg
Http://www.cnbeta.com/articles/385387.htm
http://www.ifanr.com/512005
2, how to extract the article after fetching the label? Recommended for similar articles later in the article.
The first question and the question that has been repeated: how to identify and extract the text of the Web page?
The second question I wrote a simple word-of-phrase algorithm, and then I extracted the word with high frequency rate as the key word. Even a very simple algorithm, for the majority of the web effect is still not wrong.
The word algorithm has a lot of now, you can search;
Keyword extraction There are a lot of things that you can search ...
The second question seems to be repeating the problem.
I previously wrote a collection plugin in PHP. This is called the page body extract.
The algorithm is probably as follows:
1, decomposition of the Web page into a lot of Dom block.
2, decomposed Dom block. You have to use a certain standard to discard, exclude. For example, some DOM fast inside, are a lot of links this is generally a list. can be discarded. It also calculates the ratio of the text density (text/html). such as the percentage of labels (span,p,a,font). And so on, after multiple filtering, there will eventually be a few dom blocks left. Then filter out according to certain rules. The correct rate will be higher.
The most important value can also be used as a reference, I see in a paper, with a period of time in a text to determine how much.
If there are a lot of periods in a large piece of text, then this DOM is likely to be fast content.
I have written a Java version of the crawler (Gworm), put forward a little humble opinion, if you can give you a URL, the intelligent extraction of the article part of the webpage is still very difficult (also not no way, to use the statistical probability method, can not be absolutely correct). So my previous plan is to use CSS selectors to extract content, rather than handwritten regular expressions, a Web site's CSS style name is generally very stable, so that all the articles of a site only need an extraction rule, and your second question, get the article tag, using CSS selector can also be easily solved. I don't know what libraries in Python can provide CSS selection for DOM, but I do believe there must be some that correspond to me using the Java version of the CSS selector, which is jsoup.
Update: Just google a bit of "Python CSS selector" a lot of results. Look at this article, https://pythonhosted.org/cssselect/.
There are pyquery in Python
PHP has Phpquery
Are easy to handle with jquery syntax,
Python has scrapy framework, very good, there is a scrapinghub cloud platform, can save you a lot of work;
As for the fetch tag, it involves the classification and clustering algorithm, there are a lot of choices in this area
If the amount of data is small, use readability API to save trouble.
Https://www.readability.com/developers/api/parser
It is recommended not to use regular to do HTML parsing, learning about lxml, and then in the Chrome browser development This mode can copy the corresponding DOM node XPath directly in the lxml, save a lot of things, and lxml parsing html,xml performance leverage