Related question: it is a little difficult to extract the body of a website and recently write a crawler that uses regular expressions to match the content of an article. Every website must write regular expressions. 1. How can I intelligently crawl web articles? What should I do? Eg: www. cnbeta. comarticles0000387... related question: how to extract the webpage text
Recently I wrote a crawler that uses regular expressions to match the content of an article. This is a little troublesome. Every website must write regular expressions.
1. How can I intelligently crawl web articles? What should I do?
Eg:
Http://www.cnbeta.com/articles/385387.htm
Http://www.ifanr.com/512005
2. How can I extract the tag of an article after capturing it? Used for later recommendation in similar articles.
Reply content:
Related question: how to extract the webpage text
Recently I wrote a crawler that uses regular expressions to match the content of an article. This is a little troublesome. Every website must write regular expressions.
1. How can I intelligently crawl web articles? What should I do?
Eg:
Http://www.cnbeta.com/articles/385387.htm
Http://www.ifanr.com/512005
2. How can I extract the tag of an article after capturing it? Used for later recommendation in similar articles.
The first and existing problems are important: how to identify and extract the webpage text ?.
The second problem is that I used the word segmentation algorithm to extract words with a high rate. Even a very simple algorithm does not have much effect on most data pages.
However, you can search for many word segmentation algorithms;
There are many useful words to be extracted. You can search for them...
The second question seems to be more important than the existing one.
I used php to write a collection plug-in. You are called webpage body extraction.
The algorithm is roughly as follows:
1. Break down a webpage into many DOM blocks.
2. Decomposed dom blocks. You must discard and exclude them with certain standards. For example, in some dom files, there are a bunch of links. This is generally a list. Can be discarded. The ratio of text density (text/html) is also calculated. For example, the percentage of tags (span, p, a, font. Wait. After multiple filters, a few dom blocks will be left. Then filter out according to certain rules. The accuracy is relatively high.
The most important value can also be used as a reference. I can see in a thesis that the number of periods in a piece of text is used to determine the number of periods.
If there are a lot of periods in a large text, it is likely that the dom content is fast.
I have previously written a java-version crawler (Gworm). I 'd like to give you an example, it is still very difficult to extract the article part of the web page intelligently (it is not the best way to do it. The method of statistical probability cannot be correct ). Therefore, my previous solution was to use the css selector to extract the content without manually writing regular expressions. the css style names of a website are generally stable, in this way, only one extraction rule is required for all articles on a website. In addition, you can easily obtain the article tag and use the css selector to solve the second problem. When the subject crawls using python, I don't know which library of python can provide the css selection function for DOM, but I believe there must be, the css selector for java is Jsoup.
Update: Just google the "python css selector" handler to get the result. Please refer to the following link for more information: https://pythonhosted.org/cssselect /.
Python has pyquery
Php has phpquery
It is very convenient to use jquery syntax for processing,
Python has the scrapy framework, which is good. There is also a scrapinghub cloud platform that can save a lot of your work;
As for tag capturing, classification and clustering algorithms are involved, and there are many choices.
If the data volume is small, use the readability api to save trouble.
Https://www.readability.com/developers/api/parser
We recommend that you do not use regular expressions for html Parsing. Learn about lxml. Then, in the Development Mode of chrome browser, you can copy the xpath of the corresponding DOM node directly in lxml, saving a lot of trouble, in addition, lxml parses html and xml performance.