Simple solution to extract the text from an HTML file test results __html

Source: Internet
Author: User

Lazy people will see the results directly:

The author of the experiment so far, and abandoned the neural network solution-directly using these eigenvalues to judge the threshold, and some special parts set filter rules, which seems to be more simple than the performance of neural networks, effective ... If you still adhere to the neural network solution, perhaps, adopt:
Text length, text length link number, the result of the previous line to do eigenvalue, using three weak classifier Ada-boost combination classification may be a good choice.
In addition, the definition of the text actually has a great effect on the result. In fact, if you can define a category based on what it is, then the division of that category might actually be predictable, rather than directly designing threshold processing.

First, Introduction
This article is based on the results of the ALEXJC <the easy Way to Extract useful text from arbitrary html>. See the original text:
http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/
--ALEXJC Original

Http://blog.csdn.net/lanphaday/archive/2007/08/13/1741185.aspx
--Chinese-English version of Flower butterfly translation

The main content of this article is about how to use the text relative to other text, body text and generate the text of the necessary HTML bytecode ratio of the law, the use of neural network to identify the text filter filter ads effect. The main design is as follows:

1. Parse the HTML code and note the number of bytes processed.
2. Save the text of the parsed output in the form of rows or paragraphs.
3. Count the number of bytes in the corresponding HTML code for each line of text
4. Get text density by calculating the ratio of text relative to the number of bytes
5. Finally, a neural network is used to determine whether this line is part of the text.


Second, the design plan
This experiment has several modifications to the original ALEXJC design scheme:

1. Using Rprop (elastic BP Network) to replace the source of the sensor;
2. Since the original text did not normalized the length of the text and the length of the HTML byte, it did not use the original literal length, the length of the HTML byte as the characteristic value. In contrast, the normalized text length, HTML byte length, as well as forward backward to n rows of various combinations are tested.
3. The test text is an arbitrary selection of 10 Web pages on the Internet, see annex.
4. The original text does not mention how to define whether a single line is the body, so there are several body types defined here:
A content-type text, characterized by a long continuous text segment, which defines the text as the text;
b The Forum type, there are short discontinuous paragraphs of text, define these paragraphs as the text;
C Forum post list type (part of the experiment will be the type of training to see the effect of the forum post list is not the text here to discuss ...) ), the post title is the text;
d) Home type, defined as no text (Ernie, who can say, Sina homepage which is the text.) )

Experimental environment:
1. Language: java,jre1.5
2. Operating system: Windows XP

Third, the experimental process:
1. Design to achieve a three-tier Rprop network (surprisingly, no one in this field to write an open source component, the open source giants such as Apache are not interested in neural network.) )。

/*
* Initialize Rprop Object
*
* This function is used to create a Rprop object before training
Parameters
* int In_num input layer number;
* int Hidden_unit_num hidden layer node number
* int out_num number of output layers
*
*/
Public Rprop (int in_num, int hidden_unit_num, int out_num)

/*
* Initialize Rprop Object
*
* This function is used to create the Rprop object after training
Parameters
* int In_num input layer number;
* int Hidden_unit_num hidden layer node number
* int out_num number of output layers
* double[][] W1 Hidden layer Weight
* double[][] W2 Output Layer Weight
* double[] B1 hidden Layer deviation value
* double[] b2 output Layer deviation value
*/
Public Rprop (int in_num, int hidden_unit_num, int out_num, double[][] W1, double[][] W2, double[] B1, double[]

/*
* Calculate Output results
*
Parameters
* double[] P input Parameters
* Return value:
* double[] Output results
*/
Public double[] Output (double[] p)

/*
* Training
*
Parameters
* double[][] P
* Training Sample Set
* double[][][] t
* Desired result set, t[i][j][0] expected result, t[i][j][1] Error amplification factor
* Double Goal
* Target error, note that this network uses "variance" as a condition of error judgment
* int epochs
* Maximum training times
*/
public void Train (double[][] p, double[][][] t, double goal,int epochs)
For this implementation, interested friends at the end of this article download the attachment.

2. Select a Feature value
In the experiment, the author tries various characteristic values combination:
1 text density, text length, HTML byte code length, the same value before and after each line; (original set)
2 text density, text length reciprocal (normalized), before and after the same value of the two lines;
3 text in the HTML link density (full-text text length/Total number of links, used to enhance the judgment text type), text density, text length/5000 (normalized, greater than 1 when 1 processing, hereinafter referred to as the text length 2), the same value before and after two lines;
4 The text is in the HTML link density, text density, text length 2, the same value before and after two lines;
5 The text is in the HTML link density, text density, text length 2, the same value before and after line;
6 text in the HTML of the link density, text density, text length 2, the previous line is the body;

and stipulates that the network output result 0 is not the body, 1 is the body text.

In the training process, found that the training of the network hit most of the 0-value section, this is because the forum this type of short section of the Web page will result in too much of the 0 value, training on the 0-value fit. To avoid this, the error of a line on a page is multiplied by the ratio of the page's 0 value to the number of 1 values.

3. Training set Access
See annex. This is the 10 pages that are randomly extracted from the web pages that I often browse. For the definition of the desired output, see above.

Iv. Results of the experiment
1.1~5 experiments, arbitrary extraction of part of the sample set as a training set, for the training set fitting is very good, but for the test set performance is very bad (please forgive the author did not record the experimental data);

This part of the results shows that the text density is a problem to judge whether the characteristic value of the body. Looking at the data from the sample set shows that even large paragraphs of content can be very low in text density-to make the page look beautiful, there are many sites that now have a large section of text content added to the HTML code ...
In view of this, the author finally abandons the text density as the characteristic value. And considering that the ads are linked with the text, the relative number of the text connection is relatively small, so the author believes that the text length/link number as a feature value may be a better choice.

2.6 of the experiment, the performance of the unexpected very good (good to almost let the author think finally find the perfect solution ...) )
Indeed, even in the test set section, the performance is surprisingly good, but there is actually a problem: each row of the calculation is affected by the result of the previous row. The test set is the result of defining the previous line of each row beforehand, but in actual use, the result of the previous line is calculated in real time, so there will be a situation where the error in a row results in all the subsequent errors ...

At this point, if you still adhere to the neural network solution, perhaps, using:
Text length, text length link number, the result of the previous line to do eigenvalue, using three weak classifier Ada-boost combination classification may be a good choice.
In addition, the definition of the text actually has a great effect on the result. In fact, if you can define a category based on what it is, then the division of that category might actually be predictable, rather than directly designing threshold processing.

The author of the experiment so far, and abandoned the neural network solution-directly using these eigenvalues to judge the threshold, and some special parts set filter rules, which seems to be more simple than the performance of neural networks, effective ...

If you have any friends interested in, and ada-boost to carry out the experiment, the author will be very much looking forward to this friend to exchange the experience:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.