Boilerpipe (boilerplate removal and Fulltext extraction from HTML pages) Source Code Analysis

Source: Internet
Author: User

Boilerpipe (1.1.0), http://code.google.com/p/boilerpipe/

Example,
URL url = new URL ("http://www.example.com/some-location/index.html ");
// Note: Use articleextractor unless defaultextractor gives better results for you
String text =Articleextractor. Instance. gettext (URL );
The analysis starts from acticleextractor. This class uses Singleton's design pattern and obtains a unique instance using instance. The actual process is as follows:

HTML Parser
The HTML Parser is based upon cyberneko 1.9.13. It is called internally from within the extractors.
The parser takes an HTML document and transforms it intoTextdocument, Consisting of one or moreTextblocks. It knows about specific HTML elements (script, option etc.) that are ignored automatically.
Each textblock stores a portion of text from the HTML document. initially (after parsing) almost every textblock represents a text section from the HTML document, statement t for a few inline elements that do not separate per defintion (for example '<A> 'anchor tags ).
The textblock objects also store shallow text statistics for the block's content such as the number of words and the number of words in anchor text.

EXTRACTORS
Extractors consist of one or morePipelined Filters. They are used to get the content of a webpage. Several Different extractors exist, ranging from a generic defaultextractor to extractors specific for news article extraction (articleextractor ).
Articleextractor. Process ()This pipeline filter is included. This design is very scalable, and the entire processing process is divided into several small steps for implementation, A processing stream is constructed like a building block. to expand or change the processing process, you only need to add or replace one of them.
This is also very convenient for multi-language extension, such as the corresponding processing functions in the English package,
Import de. l3s. boilerpipe. Filters. English. ignoreblocksaftercontentfilter;
Import de. l3s. boilerpipe. Filters. English. keeplargestfulltextblockfilter;
To expand to other languages, such as Korean, you only need to add a Korean package in the filters package to implement these filter processing functions respectively, and then you only need to modify the import, support for Korean.

Terminatingblocksfinder. instance. Process (DOC)
| New documenttitlematchclassifier (Doc. gettitle (). Process (DOC)
| Numwordsrulesclassifier. instance. Process (DOC)
| Ignoreblocksaftercontentfilter. default_instance.process (DOC)
| Blockproximityfusion. max_distance_1.process (DOC)
| Boilerplateblockfilter. instance. Process (DOC)
| Blockproximityfusion. max_distance_1_content_only.process (DOC)
| Keeplargestfulltextblockfilter. instance. Process (DOC)
| Expandtitletocontentfilter. instance. Process (DOC );
Next, let's take a look at each step of stream processing.

Terminatingblocksfinder
Finds blocks which are potentially indicatingEnd of an article textAnd marks them with {@ link defaultlabels # Blocks}. This can be used in conjunction with a downstream {@ link ignoreblocksaftercontentfilter}. (meaning ignoreblocksaftercontentfilter must be used as its downstream)

The principle is very simple. It is to determine whether the block meets the following conditions in the case of TB. getnumwords () <20,
Text. startswith ("Comments ")
| N_comments.matcher (text). Find () // n_comments = pattern. Compile ("(? MSI) ^ [0-9] + (comments | users responded in )")
| Text. Contains ("what you think ...")
| Text. Contains ("add your comment ")
| Text. Contains ("add your comment ")
| Text. Contains ("add your comment ")
| Text. Contains ("Add Comment ")
| Text. Contains ("reader views ")
| Text. Contains ("have your say ")
| Text. Contains ("have your say ")
| Text. Contains ("Reader comments ")
| Text. Equals ("thanks for your comments-This feedback is now closed ")
| Text. startswith ("Reuters ")
| Text. startswith ("Please rate this ")
If this condition is met, the block is considered as the end of artical, and TB. addlabel (defaultlabels. indicates_end_of_text) is marked );

Documenttitlematchclassifier
This is very simple, that is, to mark the position of the title on the page based on the content of '<title>'. The practice is to generate a potentialtitles list based on the content of '<title>, then match the block and mark it as defaultlabels. title

Numwordsrulesclassifier
Classifies {@ link textblock} s as content/Not-content through rules that have been determined using the c4.8 machine learning algorithm, as described in the paper "boilerplate Detection Using Shallow text features" (wsdm 2010), special using number of words per block and link density per block.
This module implements a classifier to distinguish content/Not-content. For classifier construction, see section 4.3 in the previous article.
The classifier uses the demo-trees algorithm, uses the labeled Google News as the training set, and then pruning the demo-trees after training, applying incluced-error pruning we were able to simplify the demo-tree to only use 6 dimensions (2 features each for current, previous and next block) without a significant loss in accuracy.
Finally, describe the demo-trees demo-process Using Pseudo Code. This is the biggest benefit of demo-trees. Its demo-rules is understandable, so it can be described in various languages.
This module implementsAlgorithm 2Classifier based on number of words

Curr_linkdensity: <= 0.333333
| Prev_linkdensity <= 0.555556
| Curr_numwords <= 16
| Next_numwords <= 15
| Prev_numwords <= 4: boilerplate
| Prev_numwords> 4: Content
| Next_numwords> 15: Content
| Curr_numwords> 16: Content
| Prev_linkdensity> 0.555556
| Curr_numwords <= 40
| Next_numwords <= 17: boilerplate
| Next_numwords> 17: Content
| Curr_numwords> 40: Content
Curr_linkdensity> 0.333333: boilerplate

With classifies, the next thing is to classify and label all blocks.

Ignoreblocksaftercontentfilter
Marks all blocks as "non-content" that occur after blocks that have been marked {@ link defaultlabels # indicates_end_of_text }. these marks are ignored unless a minimum number of words in content blocks occur before this mark (default: 60 ). this can be used in conjunction with an upstream {@ link terminatingblocksfinder }.

This module is the downstream of the terminatingblocksfinder module, that is, it must be done after it. It is simple to find defaultlabels # indicates_end_of_text, And the content after it is fully labeled as boilerplate.
In addition to the length of the front body less than minimum number of words (default: 60), you must continue to capture the number of text.

Blockproximityfusion
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.
This module is used to merge blocks. The merge is based on the difference between the offset values of the two blocks not greater than 2. That is to say, a block can be separated at most.
If contentonly is required, fusion is performed only when both blocks are marked as content.
Int diffblocks = block. getoffsetblocksstart ()-prevblock. getoffsetblocksend ()-1;
If (diffblocks <= maxblocksdistance)

So how does the block offset come from? Check the code when the block is constructed.
Boilerpipehtmlcontenthandler. Flushblock ()
Textblock TB = new textblock (textbuffer. tostring (). Trim (), currentcontainedtextelements, numwords, numshortwords, numwordsinwrappedlines, numwrappedlines, offsetblocks );
Offsetblocks ++;

Textblock Constructor
This. offsetblocksstart = offsetblocks;
This. offsetblocksend = offsetblocks;
It can be seen that in the initial situation, the block offset is incremental, and if there is no fusion, offsetblocksstart and offsetblocksend are equal.
So as mentioned in the comment, only when upstream filter removes some blocks, the merging basis of this module is meaningful. Otherwise, without any deletion, all blocks meet fusion conditions.

After reading this code, I am very surprised that fusion in paper is based on the text density, but here it is only based on the block offset, which weakens.
There, adjacent text fragments of similar text density (interpreted as/similar class ") are iteratively fused until the blocks 'densities (and therefore the text classes) are distinctive
Enough.
What I do not understand is that the usage of this module in articleextractor is as follows,
Blockproximityfusion. max_distance_1.process (DOC)
| Boilerplateblockfilter. instance. Process (DOC)
| Blockproximityfusion. max_distance_1_content_only.process (DOC)
Blockproximityfusion is called twice, respectively in the down and upstream of boilerplateblockfilter (meaning in the next section), for blockproximityfusion. I can still understand the call of max_distance_1_content_only.process (DOC). After deleting the non-content block, I will perform Fusion on the remaining block. For example, there is an advertisement between the original two blocks. however, this parameter is based on offset, not text density. I personally think the function is weakened.
However, for blockproximityfusion. the call of max_distance_1.process (DOC) may be because I did not understand it. I cannot understand why I add this step. The only explanation is that I want to fusion some blocks not labeled as content into the content. the strange thing is that fusion is unconditional here (when the block is not deleted, the offset is invalid). You only need to perform Fusion with Prev if the current block is content. and why should we only judge whether the current block and prevblock is the content? fusion. I personally think this logic is totally unreasonable ......

Boilerplateblockfilter
Removes {@ link textblock} s which have explicitly been marked as "not content"
To put it bluntly, traverse each block and delete all the items not marked as "content.

Keeplargestfulltextblockfilter
Keeps the largest {@ link textblock} Only (by the number of words ). in case of more than one block with the same number of words, the first block is chosen. all discarded blocks are marked "not content" and flagged as {@ link defaultlabels # might_be_content}
It is easy to understand. Find the largest text block as the body, and mark it as defaultlabels # might_be_content.

Expandtitletocontentfilter
Marks all {@ link textblock} s "content" which are between the headline and the part that has already been marked content, if they are marked {@ link defaultlabels # might_be_content }. this filter is quite specific to the news domain.
The logic is to find the block marked as defaultlabels. Title and the block starting with the content, and mark the block marked as might_be_content with the content.

Textdocument. getcontent ()
The last step is to output the extracted content into text, traverse each block labeled as content, append the content and output it.

Defaultextractor
Next let's take a lookArticleextractor(For News), it is very common to use defaultextractor
Simpleblockfusionprocessor. instance. Process (DOC)
| Blockproximityfusion. max_distance_1.process (DOC)
| Densityrulesclassifier. instance. Process (DOC );
It is relatively simple, just three steps. The second step is strange. If no upstream will mark content, nothing will be done in this step.

Simpleblockfusionprocessor
Merges two subsequent blocks if their text densities are equal.
Traverse each block. The text densities of the two blocks are the same as those of merge.

Densityrulesclassifier
Classifies {@ link textblock} s as content/Not-content through rules that have been determined using the c4.8 machine learning algorithm, as described in the paper "boilerplate Detection Using Shallow text features ", particle ly using text densities and link densities.
ReferenceNumwordsrulesclassifierIn paperAlgorithm 1Densitometric Classifier
Curr_linkdensity: <= 0.333333
| Prev_linkdensity <= 0.555556
| Curr_textdensity <= 9
| Next_textdensity <= 10
| Prev_textdensity <= 4: boilerplate
| Prev_textdensity> 4: Content
| Next_textdensity> 10: Content
| Curr_textdensity> 9
| Next_textdensity = 0: boilerplate
| Next_textdensity> 0: Content
| Prev_linkdensity> 0.555556
| Next_textdensity <= 11: boilerplate
| Next_textdensity> 11: Content
Curr_linkdensity> 0.333333: boilerplate

If you are interested, you can learn other extractors or design your own extractor.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.